mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Why are clustering emails not clustering similar stuff?
Date Sat, 08 Jun 2013 13:25:08 GMT
How are you verifying your vectorization?

What do you use for weighting of words?

Have you tested the distance between the notifications and other documents?  Are closely duplicate
documents close to each other? 

Sent from my iPhone

On Jun 6, 2013, at 7:47, Jesvin Jose <frank.einstien@gmail.com> wrote:

> I tried to cluster 1000 emails of a person using Kmeans, but clusters are
> not forming okay. For example if Facebook sends notifications about James
> Doe and 5 other people, I get 5 clusters like:
> 
> :VL-858{n=7
>    Top Terms:
>        doe                                   =>  10.066998481750488
>        james                                =>  10.066998481750488
> 
> Why are notifications for all 5 people not getting clustered together? I
> used variants of the commands used in Mahout in Action, Sean Owen et al as
> follows:
> 
> Vectorizing uses lowercasing, stop words and length filter:
> 
> bin/hadoop jar
> /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar
> org.apache.mahout.driver.MahoutDriver seq2sparse -i mymail-seqfiles -o
> mymail-vectors-bigram -ow  -a mia.clustering.ch10.MyAnalyzer -chunk 200 -wt
> tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq
> 
> Its for 1000 emails, but I tried 100 clusters. If I tried 50, I still get
> similar results but half the number of emails "get into" any cluster.
> 
> bin/hadoop jar
> /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar
> org.apache.mahout.driver.MahoutDriver kmeans -i
> mymail-vectors-bigram/tfidf-vectors -c mymail-initial-clusters -o
> mymail-kmeans-clusters-from-bigrams -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 100 -x
> 20 -cl
> 
> -- 
> We dont beat the reaper by living longer. We beat the reaper by living well
> and living fully. The reaper will come for all of us. Question is, what do
> we do between the time we are born and the time he shows up? -Randy Pausch

Mime
View raw message