mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <jeast...@Narus.com>
Subject RE: Dirichlet Process Clustering not working
Date Wed, 19 Oct 2011 16:04:21 GMT
I agree something is amiss here, but it could be the model is just not suitable for this problem.
Running with the Reuters dataset, I see all the points being assigned to C-0 in the very first
iteration as you do. I think the problem is with the pdf() calculations in the mapper for
very wide vectors such as we are using. For smaller dimension vectors, DPC appears to be working
great. 

I'm going to commit the build-reuters.sh enhancements I've added for FuzzyK and DPC so we
can both use the same platform. I will report more progress as I dig in deeper today...

-----Original Message-----
From: edward choi [mailto:mp2893@gmail.com] 
Sent: Wednesday, October 19, 2011 8:11 AM
To: user@mahout.apache.org
Subject: Re: Dirichlet Process Clustering not working

Okay, I've just tried DPC with reuters document set.
I let the 'build-reuters.sh' create the sequence files and vectors. (From
the looks of its dictionary generated by mahout, the number of features
seemed to be less than 100,000)
Then I used them to do DPC. (15 clusters, 10 iteration, 1.0 alpha,
clustering true, no addtional options)
Below is the result of the clusterdump of clusters-10
----------------------------------------------------------------------------------------------------------------------------
C-0: GC:0{n=15745 c=[0:0.026, 0.003:0.001, 0.01:0.004, 0.02:0.002,
0.05:0.004, 0.07:0.005, 0.07
    Top Terms:
        said                                    =>  1.6577128281476725
        mln                                     =>  1.2455441154347937
        dlrs                                    =>  1.1173752482257673
        3                                       =>   1.042824193090437
        pct                                     =>  1.0223684722334667
        reuter                                  =>  0.9934255143959358
C-1: GC:1{n=0 c=[0:-0.595, 0.003:0.228, 0.01:-0.401, 0.02:-0.711,
0.05:1.840, 0.07:0.136, 0.077:-0.739, 0.1:-0.177, 0.10:
    Top Terms:....
C-10: GC:10{n=0 c=[0:0.090, 0.003:-1.426, 0.01:-0.472, 0.02:0.672,
0.05:0.800, 0.07:0.691, 0.077:1.037, 0.1:0
    Top Terms:....
C-11: GC:11{n=0 c=[0:-0.835, 0.003:-1.748, 0.01:-1.030, 0.02:-1.760,
0.05:-0.343, 0.07:0.286, 0.077:1.179,
    Top Terms:....
----------------------------------------------------------------------------------------------------------------------------
I guess the same thing happened again. So the document set is not the
problem. Something is definitely wrong with DPC.
Interesting thing is that the first cluster point does not have a single
negative value in it.
Rest of the cluster points have a lot of negative values. So I guess this
phenomenon has something to do with the first cluster hogging all the
documents.
Any comments on this result?
(I haven't tried TestClusterDumper.testDirichlet2&3 yet. I'll post another
thread when I am done with that).

Regards,
Ed



Mime
View raw message