mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleksander Stensby <aleksander.sten...@integrasco.com>
Subject Re: Some basic introductory questions
Date Thu, 17 Sep 2009 18:59:43 GMT
Thanks for all the replies guys!
I understand the flow of things and it makes sense, but like Shawn pointed
out there could still be more abstraction (and once I get my hands dirty
I'll try to do my best to contribute here aswell:) )

And to Levy: your proposed flow of things makes sense, but what I wanted was
to do all that from one entry point. (Ideally, I don't want to do manual
stuff here, I want everything to be able to run on a regular basis from a
single entrypoint - and then I mean any algorithm etc). And I can probably
do that just fine by using the Drivers etc.

Again, thanks for the replies!

Cheers,
 Aleks

On Thu, Sep 17, 2009 at 3:35 PM, Grant Ingersoll <gsingers@apache.org>wrote:

>
> On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote:
>
>  Hi Aleksander,
>>
>> I've also been learning how to run mahout's clustering and LDA on our
>> cluster.
>>
>> For k-means, the following series of steps has worked for me:
>>
>> * build mahout from trunk
>>
>> * write a program to convert your data to mahout Vectors.  You can base
>> this on one of the Drivers in the mahout.utils.vectors package (which
>> seem designed to work locally).  For bigger datasets you'll probably
>> need to  write a simple map reduce job, more like
>> mahout.clustering.syntheticcontrol.canopy.InputDriver.  In either event
>> your Vectors need to end up on the dfs.
>>
>
> Yeah, they are designed for local so far, but we should work to extend
> them.  I think as Mahout matures, this problem will become less and less.
>  Ultimately, I'd like to see utilities that simply ingest whatever is up on
> HDFS (office docs, PDFs, mail, etc.) and just works, but that is a _long_
> way off, unless someone wants to help drive that.
>
> Those kinds of utilities would be great contributions from someone looking
> to get started contributing.  As I see it, we could leverage Apache Tika
> with a M/R job to produce the appropriate kinds of things for our various
> algorithms.
>
>
>> * run clustering with org.apache.mahout.clustering.kmeans.KMeansDriver,
>> something like:
>>  hadoop jar mahout-core-0.2-SNAPSHOT.job
>> org.apache.mahout.clustering.kmeans.KMeansDriver -i /dfs/input/data/dir
>> -c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters>
>> -x <maxIters>
>>
>> * possibly fix the problem described here
>> http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run
>> -of-KMeans-td24505889.html (solution is at the bottom of the page)
>>
>> * get all the output files locally
>>
>> * convert the output to text format with
>> org.apache.mahout.utils.clustering.ClusterDumper.  It might be nicer to
>> do this on the cluster, but the code seems to expect local files.  If
>> you set the name field in your input Vectors in the conversion step to a
>> suitable ID, then the final output can be a set of cluster centroids,
>> each followed by the list of Vector IDs in the corresponding cluster.
>>
>> Hope this is useful.
>>
>> More importantly, if anything here is very wrong then please can a
>> mahout person correct me!
>>
>
> Looks good to me.  Suggestions/patches are welcome!
>
>


-- 
Aleksander M. Stensby
Lead Software Developer and System Architect
Integrasco A/S
E-mail: aleksander.stensby@integrasco.com
Tel.: +47 41 22 82 72
www.integrasco.com
http://twitter.com/Integrasco
http://facebook.com/Integrasco

Please consider the environment before printing all or any of this e-mail

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message