mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Cosine and Tanimoto Similarity
Date Sat, 26 Dec 2009 23:04:45 GMT
On Sat, Dec 26, 2009 at 2:47 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> One minor additional point is that you might want to use (1-cos)/2 in order
> to get a result in [0,1].
>

For distance, yeah, this can be fine, but for vectors which can have
negative
components, I don't like doing that with similarity (where 'that' means
"forcing
the range to be [0,1]"), because then "perfect similarity" is 1 (good so
far),
"perfect dissimilarity" (aka anticorrelated/antiparallel) is 0 (still good),
but two
randomly chosen vectors will have "similarity" 0.5, which seems weird to me.

I far prefer the set of vectors which are uncorrelated with a given vector
to
have similarity with it clustered around zero, because that makes intuitive
sense to me.  Distance is something which is different, and living in a
compact
space makes distance kinda weird, since there is a maximum value, so scaling
that to 1 is fine, and says that in general, two points chosen at random
are
distance 0.5 away from each other.

I guess it depends on how you look at it.

  -jake


> On Sat, Dec 26, 2009 at 1:32 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
>
> > On Sat, Dec 26, 2009 at 12:18 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > These are fine as distance measures.  It is also common to use
> > > sqrt(1-cos^2)
> > > which is more like an angle, but 1-cos is good enough for almost
> > anything.
> > >
> > > With normal text, btw, all of the coordinates are positive so the
> largest
> > > possible angle is pi/2 (cos = 0, sin = 1).
> > >
> >
> > I guess what I was saying is that if you take a less "normal"
> > representation
> > of text (a random projection, say, or a projection onto the SVD, etc.),
> you
> > can get negative similarities which make sense, and in this case you have
> > similarity == 1 for perfect alignment, 0 for uncorrelated, and -1 for
> > anti-parallel,
> > and you definitely *want* -1, not +1.
> >
> > Going with sqrt(2*(1-cos^2)) ~=~ theta  is only good for small angles -
> for
> > large angles this isn't so great anymore, and once the angle goes over
> > pi/2,
> > it's actually no longer monotonic and is doing most certainly the wrong
> > thing,
> > which is why I usually stick with 1-cos for distance if I'm not measuring
> > similarity.
> >
> > I guess my question to you, Robin, is why would you take the abs?  If the
> > data is text, then yes, in a normal representation your coefficients are
> > always
> > positive, and so all cosines are greater than zero, and there's no need
> to
> > take
> > abs, right?
> >
> > The only case where I'd imagine wanting to consider anti-parallel to be
> > basically
> > the same as parallel is in the collaborative filtering case, where as
> we've
> > discussed on this list in the past, sometimes a negative rating is as
> much
> > a measure of similarity as a positive one, and so if you've mean-centered
> > your
> > ratings, then you do want dot products which effectively take the abs as
> > well.
> >
> > I'd say that is the exception, not the norm, however.
> >
> >  -jake
> >
> >
> > >
> > > On Sat, Dec 26, 2009 at 10:53 AM, Robin Anil <robin.anil@gmail.com>
> > wrote:
> > >
> > > > I ran Cosine and tanimoto distance measure ( d = 1 - similarity
> > measure)
> > > on
> > > > the following vector pairs
> > > >
> > > > (-1, -1) and (3,3) Cosine : 2.0
> > > > Tanimoto: 1.2307692307692308
> > > > (1, 1)   and (3,3) Cosine : 0.0
> > > > Tanimoto: 0.5714285714285714
> > > > (1, 8)   and (8,1) Cosine : 0.7538461538461538
> > > >  Tanimoto: 0.8596491228070176
> > > >
> > > > How should anti parallel vectors be treated in MAHOUT clustering
> > > packages.
> > > > is it acceptable? to return 2.0 for antiparallel vectors and 1.0 for
> > > > perpendicular vectors in the case of text data the vectors are
> > positive.
> > > If
> > > > clustering of scientific data is done, what should be the default
> > > > behaviour. Since clustering is always trying find a configuration
> where
> > > the
> > > > distances are at the minimum? Since I have dealt mostly with Text
> data,
> > I
> > > > would always try and get the abs value of cosine similarity before
> > > > subtracting from 1.0. Has anyone of you encountered a such a
> situation
> > > wrt
> > > > some particular dataset?
> > > > Robin
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message