mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bogdan Vatkov <>
Subject Re: Stopwords work for Solr but not for Mahout
Date Sat, 02 Jan 2010 18:27:17 GMT
Thanks for the Luke hint, I will try it out but now I noticed something else
which is very very strange - I ran k-means on 23K+ docs and with 50 clusters
which all seem to be very very strange as top term collection - I would say
for 90% of the top terms I get some words which I barely recognize.
I did a short check and for one particular term, which anyway sounded
strange and which appeared in top terms for 9 of the 50 clusters, I found
that it has "doc freq" = 2 in the Solr dictionary.
How is this even possible - for 23, 000 docs and for a term which is
mentioned only 2 times I have it as a top term in 9 clusters? I definitely
did something wrong, do you have an idea what that could be?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message