mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dipesh <dipshres...@gmail.com>
Subject Re: Text clustering
Date Mon, 08 Dec 2008 05:06:18 GMT
Yes, Richard is right. I used the arc of the value and it solved the
mismatch.
Math.acos(value) which would range from 0 to π / 2.
"...π / 2 meaning independent, 0 meaning exactly the same, with in-between
values indicating intermediate similarities or dissimilarities...."
--wiki<http://en.wikipedia.org/w/index.php?title=Jaccard_index&section=2#Tanimoto_coefficient_.28extended_Jaccard_coefficient.29>

I think since Tanimoto distance is more suited for only binary values, (but
with TF-IDF we have other values than 0s and 1s).

Pearson correlations as Sean has suggested works for cosine distance if, the
data are 'centered' (have a mean of 0). But I think as Richard said (in
TF-IDF vectors we aren't going to get any negative values), we can't have
mean of 0.

Regards,
Dipesh


>
> 2008/12/6 Sean Owen <srowen@gmail.com>
>
> > To answer a few recent points:
> >
> > Not sure if this is helpful, but, the collaborative filtering part of
> > Mahout contains an implementation of cosine distance measure -- sort
> > of. Really it has an implementation of the Pearson correlation, which
> > is equivalent, if the data are 'centered' (have a mean of 0). This is,
> > in my opinion, a good idea. So if you agree, you could copy and adapt
> > this implementation of Pearson to your purpose. It is pretty easy to
> > re-create the actual cosine distance measure correlation too from this
> > code -- I used to have it separately in the code.
> >
> > The Tanimoto distance is a ratio of intersection to union of two sets,
> > so is between 0 and 1. Cosine distance is, essentially, the cosine of
> > an angle in feature-space, so is between -1 and 1.
> >
>



-- 
----------------------------------------
"Help Ever Hurt Never"- Baba
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message