mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <shashik...@gmail.com>
Subject Re: Clustering techniques, tips and tricks
Date Sat, 02 Jan 2010 07:15:21 GMT
On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <gsingers@apache.org> wrote:
>
> The other thing I'm interested in is people's real world feedback on using clustering
to solve their text related problems.
> For instance, what type of feature reduction did you do (stopword removal, stemming,
etc.)?  What algorithms worked for you?
> What didn't work?  Any and all insight is welcome and I don't particularly care if it
is Mahout specific (for instance, part of
> the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable)
>

Let me start by saying Mahout works great for us. We can run k-means
on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a
single host.

Using vector normalization like L2 norm helped quite a bit. Thanks to
Ted for this suggestion. In text clustering, you have lots of small
documents. This results into very sparse vectors (total of 100K
features with each vector having 200 features.) Using vanilla TFIDF
weights doesn't work as nicely.

Even if we don't do explicit stop word removal, the threshold values
for document count does that in a better fashion. If you exclude the
features which are extremely common (say more than 40% documents) or
extremely rare (say in less than 50 documents in a corpus of 100K
docs), you have a meaningful set of features. The current K-Means
already accepts these parameters.

Stemming can be used for feature reduction, but it has a minor issue.
When you want to find out prominent features of the resulting cluster
centroid, the feature may not be meaningful. For example,  if a
prominent feature is "beautiful", when you get it back, you will get
"beauti." Ouch.

I tried fuzzy K-Means for soft clustering, but I didn't get good
results. May be the corpus had the issue.

One observation about the clustering process is that it is geared, by
accident or by design, towards batch processing. There is no
support for real-time clustering. There needs to be glue which ties
all the components together to make the process seamless. I suppose,
someone in need of this feature will contribute it to Mahout.

Grant,  If I recall more, I will mail it to you.

--shashi

Mime
View raw message