mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From edward choi <>
Subject Re: Dirichlet Process Clustering not working
Date Wed, 19 Oct 2011 00:47:23 GMT
I did do DPC with alpha 1.0 with no luck. Then I tried with alpha 2.0, still
no luck. I doubt that it is a problem related to parameter setting.

I don't know the exact number, but I am pretty sure that the number of
features of my document vectors are easily over 150,000.
I wanted to use all numerical figures, and all kinds of nouns and verbs. I
normalized the nouns and verbs but they should exceed at least 100,000.
I guess that is too large a number of features. (FYI, I set maxDFPercent 50
when making vectors)

I'll give TestClusterDumper.testDirichlet a try. And I definitely should
test with the reuters document set also. See if there is any difference than
using my document set.
Thanks for the advice. I'll make a post when done testing.


2011/10/19 Jeff Eastman <>

> Check out TestClusterDumper.testDirichlet2&3 for an example of text
> clustering using DPC. It produces reasonable looking clusters when compared
> with k-means and the other algorithms, but on a small vocabulary. Also check
> out DisplayDirichlet, which does a great job of clustering some random 2-d
> data.
> I'd suggest trying the default 1.0 alpha as is done in the cluster dumper
> tests. Also, the default model is GaussianCluster and it may not perform
> well with a large feature space. Check the pdf() function which uses the
> product of the component pdfs to produce the composite value for each
> cluster. This may not be optimal for really large term vectors. How many
> elements are in your term vectors? You may need to create your own model and
> model distribution to make DPC perform on your data.
> Jeff
> -----Original Message-----
> From: edward choi []
> Sent: Tuesday, October 18, 2011 7:06 AM
> To:
> Subject: Dirichlet Process Clustering not working
> Hi,
> This is my first time using Mahout, though it's been over a year playing
> with Hadoop and Hbase.
> I collected several hundred thousand news articles from RSS. And I wanted
> to
> do a dirichlet process clustering(DPC) with them.
> I did as the mahout wiki told me to do. (Making sequence files from normal
> documents, then making them into vectors, and then doing DPC, then finally
> clusterdumping)
> My DPC setting was: 20 clusters. 10 iterations. 2.0 alpha. clustering true,
> emitMostLikely false. No modelDist, modelPrototype, distanceMeasure was
> specified.
> Number of documents were 5896. (I preprocessed the docs so that they would
> only contain verbs and nouns).
> The result was not what i had expected.
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> C-0: GC:0{n=5896 c=[0:0.001, 0.07:0.000, 0.08:0.000, 0
> .......................
>    Top Terms:
>        comment                                 =>0.015425061539023016
>        2011                                    =>0.011413068888273332
>        reserve                                 =>0.011253999429472274
>        rights                                  => 0.01115527360420605
>        use                                     =>0.010942002711960384
>        rights reserve                          =>0.010882667414113879
>        copyright                               =>0.010399572042096333
>        publish                                 =>0.009924242339732702
>        time                                    => 0.00988611270657134
>        material                                =>0.009849842593611612
> C-1: GC:1{n=0 c=[0:-0.239, 0.07:0.775, 0.08:-0.767,.....
>    Top Terms:.......
> C-10: GC:10{n=0 c=[0:-1.116, 0.07:-0............
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> This is what the clusterdump looks like. To my understanding, this means
> that all the documents were assigned to one cluster point, namely C-0.
> I changed the DPC settings around. I also changed the process of making
> vectors a bit, but always the same result.
> I was so out of clue, I tried Kmeans with the exact same documents and
> vectors. And they worked!!! I don't know how I am supposed to understand
> this.
> I looked up google but there was no definite solution so I guess everybody
> else is doing fine with DPC.
> Please could someone tell me what I am doing wrong? (oh, and I am using
> standalone mode with Mahout)
> Regards,
> Ed

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message