mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Clustering from DB
Date Thu, 02 Jul 2009 03:32:36 GMT

On Jul 1, 2009, at 9:37 AM, nfantone wrote:

> Ok, so I managed to write a VectorIterable implementation to draw data
> from my database. Now, I'm in the process of understanding the output
> file that kMeans (with a Canopy input) produces. Someone, please,
> correct me if I'm mistaken. At first, my thought was that there were
> as many "cluster-i" directories as clusters detected from the dataset
> by the algorithm(s), until I printed out the content of the
> "part-00000" file in them. It seems as though it stores a <Writable>
> cluster ID and then a <Writable> Cluster, each line. Are those all the
> actual clusters detected? If so, what's the reason behind the
> directory nomenclature and its consecutive enumeration?

I was wondering the same thing myself.  I believe it has to do with  
the number of iterations or reduce tasks, but I haven't looked closely  
at the code yet.  Maybe Jeff can jump in here.

> Does every
> "part-00000", in different "cluster-i" directories, hold different
> clusters? And, what about the "points" directory? I can tell it
> follows a <VectorID, Value> register format. What's that value
> supposed to represent? The ID from the cluster it belongs, perhaps?

I believe this is the case.

> There really ought to be documentation about this somewhere. I don't
> know if I need some kind of permission, but I'm offering myself to
> write it and upload it to the Mahout wiki or wherever it should be,
> once I finished my project.


> Thanks in advanced.
> On Fri, Jun 26, 2009 at 1:54 PM, Sean Owen<> wrote:
>> All of Mahout is generally Hadoop/HDFS based. Taste is a bit of
>> exception since it has a core that is independent of Hadoop and can
>> use data from files, databases, etc. It also happens to have some
>> clustering logic. So you can use, say, TreeClusteringRecommender to
>> generate user clusters, based on data in a database. This isn't
>> Mahout's primary clustering support, but, if it fits what you need,  
>> at
>> least it is there.
>> On Fri, Jun 26, 2009 at 12:21 PM, nfantone<> wrote:
>>> Thanks for the fast response, Grant.
>>> I am aware of what you pointed out about Taste. I just mentioned  
>>> it to
>>> make a reference to something similar to what I needed to
>>> implement/use, namely the "DataModel" interface.
>>> I'm going to try the solution you suggested and write an
>>> implementation of VectorIterable. Expect me to come back here for
>>> feedback.

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

View raw message