mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <>
Subject Re: Clustering from DB
Date Wed, 01 Jul 2009 13:37:38 GMT
Ok, so I managed to write a VectorIterable implementation to draw data
from my database. Now, I'm in the process of understanding the output
file that kMeans (with a Canopy input) produces. Someone, please,
correct me if I'm mistaken. At first, my thought was that there were
as many "cluster-i" directories as clusters detected from the dataset
by the algorithm(s), until I printed out the content of the
"part-00000" file in them. It seems as though it stores a <Writable>
cluster ID and then a <Writable> Cluster, each line. Are those all the
actual clusters detected? If so, what's the reason behind the
directory nomenclature and its consecutive enumeration? Does every
"part-00000", in different "cluster-i" directories, hold different
clusters? And, what about the "points" directory? I can tell it
follows a <VectorID, Value> register format. What's that value
supposed to represent? The ID from the cluster it belongs, perhaps?

There really ought to be documentation about this somewhere. I don't
know if I need some kind of permission, but I'm offering myself to
write it and upload it to the Mahout wiki or wherever it should be,
once I finished my project.

Thanks in advanced.

On Fri, Jun 26, 2009 at 1:54 PM, Sean Owen<> wrote:
> All of Mahout is generally Hadoop/HDFS based. Taste is a bit of
> exception since it has a core that is independent of Hadoop and can
> use data from files, databases, etc. It also happens to have some
> clustering logic. So you can use, say, TreeClusteringRecommender to
> generate user clusters, based on data in a database. This isn't
> Mahout's primary clustering support, but, if it fits what you need, at
> least it is there.
> On Fri, Jun 26, 2009 at 12:21 PM, nfantone<> wrote:
>> Thanks for the fast response, Grant.
>> I am aware of what you pointed out about Taste. I just mentioned it to
>> make a reference to something similar to what I needed to
>> implement/use, namely the "DataModel" interface.
>> I'm going to try the solution you suggested and write an
>> implementation of VectorIterable. Expect me to come back here for
>> feedback.

View raw message