mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleksander Stensby <>
Subject Re: Some basic introductory questions
Date Thu, 17 Sep 2009 18:59:43 GMT
Thanks for all the replies guys!
I understand the flow of things and it makes sense, but like Shawn pointed
out there could still be more abstraction (and once I get my hands dirty
I'll try to do my best to contribute here aswell:) )

And to Levy: your proposed flow of things makes sense, but what I wanted was
to do all that from one entry point. (Ideally, I don't want to do manual
stuff here, I want everything to be able to run on a regular basis from a
single entrypoint - and then I mean any algorithm etc). And I can probably
do that just fine by using the Drivers etc.

Again, thanks for the replies!


On Thu, Sep 17, 2009 at 3:35 PM, Grant Ingersoll <>wrote:

> On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote:
>  Hi Aleksander,
>> I've also been learning how to run mahout's clustering and LDA on our
>> cluster.
>> For k-means, the following series of steps has worked for me:
>> * build mahout from trunk
>> * write a program to convert your data to mahout Vectors.  You can base
>> this on one of the Drivers in the mahout.utils.vectors package (which
>> seem designed to work locally).  For bigger datasets you'll probably
>> need to  write a simple map reduce job, more like
>> mahout.clustering.syntheticcontrol.canopy.InputDriver.  In either event
>> your Vectors need to end up on the dfs.
> Yeah, they are designed for local so far, but we should work to extend
> them.  I think as Mahout matures, this problem will become less and less.
>  Ultimately, I'd like to see utilities that simply ingest whatever is up on
> HDFS (office docs, PDFs, mail, etc.) and just works, but that is a _long_
> way off, unless someone wants to help drive that.
> Those kinds of utilities would be great contributions from someone looking
> to get started contributing.  As I see it, we could leverage Apache Tika
> with a M/R job to produce the appropriate kinds of things for our various
> algorithms.
>> * run clustering with org.apache.mahout.clustering.kmeans.KMeansDriver,
>> something like:
>>  hadoop jar mahout-core-0.2-SNAPSHOT.job
>> org.apache.mahout.clustering.kmeans.KMeansDriver -i /dfs/input/data/dir
>> -c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters>
>> -x <maxIters>
>> * possibly fix the problem described here
>> -of-KMeans-td24505889.html (solution is at the bottom of the page)
>> * get all the output files locally
>> * convert the output to text format with
>> org.apache.mahout.utils.clustering.ClusterDumper.  It might be nicer to
>> do this on the cluster, but the code seems to expect local files.  If
>> you set the name field in your input Vectors in the conversion step to a
>> suitable ID, then the final output can be a set of cluster centroids,
>> each followed by the list of Vector IDs in the corresponding cluster.
>> Hope this is useful.
>> More importantly, if anything here is very wrong then please can a
>> mahout person correct me!
> Looks good to me.  Suggestions/patches are welcome!

Aleksander M. Stensby
Lead Software Developer and System Architect
Integrasco A/S
Tel.: +47 41 22 82 72

Please consider the environment before printing all or any of this e-mail

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message