mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (Commented) (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
Date Tue, 08 Nov 2011 12:28:51 GMT


Grant Ingersoll commented on MAHOUT-612:

It seems like we shouldn't have to wait for the whole thing to be done on this.  Forward progress
towards where we want to go is better than no progress.
> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>                 Key: MAHOUT-612
>                 URL:
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering, Collaborative Filtering
>    Affects Versions: 0.4, 0.5
>            Reporter: Frank Scholten
>            Assignee: Sean Owen
>             Fix For: Backlog
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-kmeans.patch, MAHOUT-612-v2.patch,
> Most of the Mahout features require running several jobs in sequence. This can be done
via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static
methods or creating a String array of parameters and pass them to the main method of the job.
If we can instead configure jobs through a Java bean or factory we it will be type safe and
easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration
Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the
Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations
of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for
the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating
points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance
we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the
input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class
can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory
for creating MapReduce or sequential jobs. This would probably remove some duplicated code
in the KMeansDriver.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message