mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Cluster distance
Date Thu, 07 Jan 2010 18:45:18 GMT
The best rule is to try several cases.  L-1 and L-2 with or without
normalization are the most important cases.

The k-means clustering assumes that you have already done any term
weighting.  You should experiment a little bit there as well, but the
standard IDF measure is probably fine.  The only question is whether you
should limit the weight of singleton terms somewhat.  With large corpora,
that is less critical.  Also, if you don't use L-2 normalization, then what
you do with very rare terms will matter much less since they probably won't
ever match with anything and thus won't contribute to dot products.

On Thu, Jan 7, 2010 at 5:20 AM, Grant Ingersoll <gsingers@apache.org> wrote:

> I'm sure others can chime in w/ more of their experience.




-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message