Tel: +43 650 261 08 74

# Blog

## A few interesting distances

Today I read about a few distance measures from Chapter 3 Finding Similar Items from the Book "Minning of Massive Datasets". You can find the book and more information about everything here: http://infolab.stanford.edu/~ullman/mmds.html

Following is a little summary of the basics of the different distance measures.

Euclidian Distance

The Euclidian distance is the most familiar distance measure and is the one people normally think of as “distance”. An n-dimensional Euclidian space is one where points are vectors of n real numbers. This distance is also the L2-norm.

http://en.wikipedia.org/wiki/Euclidean_distance

Jaccard Distance

The Jaccard Distance is the opposite of the Jaccard Similarity, thus the Jaccard Distance is defined like following: d(x,y) = 1- SIM(x,y). Recall the Jaccard Similarity of sets S and T is the ratio of the size of the intersection of S and T to the size of their union.

Jaccard Simularity: SIM(x,y)

Jaccard Distance d(x,y) = 1 – SIM (x,y)

http://en.wikipedia.org/wiki/Jaccard_index

Cosine Distance

The cosine distance makes sense in spaces that have dimensions, including normal Euclidian spaces and also discrete versions of Euclidean spaces, such as spaces where points are normal vectors with integer components or simple boolean (like 0 or 1) components. In such a space, points may be thought of as directions. We do not distinguish between a vector and a multiple of that specific vector.

The cosine distance between two points is the angle that the vectors to those points make. This angle is in the range 0 to 180 degrees, regardless of how many dimensions the given space has.

The cosine distance is calculable like following:

Vector x and vector y is given, than the cosine distance between these two is the dot product x*y divided by the L2-Norms of x and y (Recall L2-Norm => Euclidian Distance – see above).

Edit Distance

Another distance measure is the Edit Distance. This distance mainly makes sense when the given points are strings. The distance between string x and string y is the smallest number of insertions and deletions of single characters to convert the first string into the second.

Another way to define and calculate the Edit Distance d(x,y) ist to compute the LCS (= longest common subsequence) of x and y.  The length of x plus the length of y minus twice the length of the LCS is also the Edit distance.

http://en.wikipedia.org/wiki/Edit_distance

Hamming Distance

The Hamming Distance makes sense in a space of vectors. The Hamming Distance could be defined as the number of components of two vectors, which differ from each other. Most commonly the Hamming Distance is used when the vectors are Boolean and only consist of 0´s and 1´s.

I know that are really just the basics, but for more information simple check out Wikipedia ore the original source (http://infolab.stanford.edu/~ullman/mmds.html)

21.01.2014

### Über den Autor Matthias Frick, MSc.

Er ist ein langjähriger Ruby-on-Rails Entwickler und leitet das Unternehmen Frick-Web.