mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: clustering your data with dirichlet issue
Date Tue, 06 Apr 2010 17:14:30 GMT
Toby Doig wrote:
> I've run dirichlet commandline and now have an output folder with some
> state-0, state-1, ... state-5 folders which each contain part-00000 and
> .part-00000.crc files. However the  ClusteringYourData wiki page's
> Retrieving the Output section just says TODO. I don't know how to turn those
> part files into something useful.
> I successfully ran
> the org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job test which
> outputted data as text (to console at least) so I tried ripping the
> printResults() methods from that class and putting them
> in org.apache.mahout.clustering.dirichlet.DirichletJob but to no avail.
> Can someone help?
> Also, when running the commandline job it asks for the prototypeSize (-s
> param) so when I converted my Lucene index to a vector file the output said
> it created 11 vectors, but with i specified that value for prototypeSize the
> job failed saying it found 1793 vectors. Changing the value i specify to
> 1793 works but i now wonder why i need to specify it if it can figure it
> out? Could it not be optional?
Hi Toby,

Each of the state-i directories contains a sequence file of the model 
states at the end of the i-th iteration. Since Dirichlet does not have a 
convergence criteria it will run for as many iterations as you select. 
Interpreting the results is also challenged by the fact that points are 
not assigned uniquely to a model - as in kmeans - or even with a 
probability - as in fuzzy kmeans. Each model does retain the number of 
points that it captured in that iteration - not the points themselves - 
so it is possible to back-fit the points to see which were the most 
likely to be captured by using the model's pdf() function and taking the 
top n points. Of course, that won't scale but check out 
TestL1ModelClustering in utils/ for some code that I used.

The ClusterDumper is not able to dump the Dirichlet clusters though 
there is an issue to do this (MAHOUT-270) which is not yet completed. 
I'm working on it though, and you are welcome to make suggestions. 
Currently I'm trying to refactor the term priorities and other stuff in 
ClusterDumper to work with the Printable interface rather than relying 
upon ClusterBase.

The prototype and prototypeSize arguments give you a way to specify the 
class and size of the Vectors which underly the existing models. One 
could probably glean this information by inspecting the first data 
element presented to the algorithm at initialization time. There is at 
this time no connection between the Lucene index to Vector 
transformation in utils and the Dirichlet job in core/ and no obvious 
way to introduce one given the dependencies.

Code suggestions and patches to improve this all are of course welcome,

View raw message