Mailing-List: contact commits-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mahout.apache.org
Date: Sun, 12 Sep 2010 12:15:00 -0400 (EDT)
From: confluence@apache.org
To: commits@mahout.apache.org
Message-ID: <30028974.16005.1284308100988.JavaMail.confluence@thor>
Subject: [CONF] Apache Mahout > Mahout on Elastic MapReduce
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Auto-Submitted: auto-generated

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Mahout on Elastic MapReduce (https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce)


Edited by Grant Ingersoll:
---------------------------------------------------------------------
h1. Introduction

This page details the set of steps that was necessary to get an example of k-Means clustering running on Amazon's [Elastic MapReduce|http://aws.amazon.com/elasticmapreduce/] (EMR). 

Note: Some of this work is due in part to credits donated by Amazon Web Services Apache Projects Testing Program.

h1. Getting Started

   * Get yourself an EMR account.  If you're already using EC2, then you can do this from [Amazon's AWS Managment Console|https://console.aws.amazon.com/], which has a tab for running EMR.
   * Get the [ElasticFox|https://addons.mozilla.org/en-US/firefox/addon/11626] and [S3Fox|https://addons.mozilla.org/en-US/firefox/search?q=s3fox&cat=all] Firefox extensions.  These will make it easy to monitor running EMR instances, upload code and data, and download results.
   * Download the [Ruby command line client for EMR|http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264&categoryID=262].  You can do things from the GUI, but when you're in the midst of trying to get something running, the CLI client will make life a lot easier.
   * Have a look at [Common Problems Running Job Flows|http://developer.amazonwebservices.com/connect/thread.jspa?messageID=124694&#124694] and [Developing and Debugging Job Flows|http://developer.amazonwebservices.com/connect/message.jspa?messageID=124695#124695] in the EMR forum at Amazon.  They were tremendously useful.
   * Make sure that you're up to date with the Mahout source.  The fix for [Issue 118|http://issues.apache.org/jira/browse/MAHOUT-118] is required to get things running when you're sending output to an S3 bucket.
   * Build the Mahout core and examples.

Note that the Hadoop that's running on EMR is version of Hadoop 0.20.0.  The EMR GUI in the AWS Management Console provides a number of examples of using EMR, and you might want to try running one of these to get started.

One big gotcha that I discovered is that the S3N file system for Hadoop has a couple of weird cases that boil down to the following advice:  if you're naming a directory in an s3n URI, make sure that it ends in a slash and you should not try to use a top-level S3 bucket name as the place where your Mahout output will be going, you should always include a subdirectory.

h1. Uploading Code and Data

I decided that I would use separate S3 buckets for the Mahout code, the input for the clustering (I used the synthetic control data, you can find it easily from the [Quickstart] page), and the output of the clustering.  

You will need to upload:
# The Mahout Job jar.  For the example here, we are using {{mahout-core-0.4-SNAPSHOT.job}}
# The data.  In this example, we uploaded two files: dictionary.txt and part-out.vec.  The latter is the main vector file and the former is the dictionary that maps words to columns.  It was created by converting a Lucene index to Mahout vectors.


h1. Running k-means Clustering

EMR offers two modes for running MapReduce jobs.  The first is a "streaming" mode where you provide the source for single-step mapper and reducer functions (you can use languages other than Java for this).  The second mode is called "Custom Jar" and it gives you full control over the job steps that will run.  This is the mode that we need to use to run Mahout.  

In order to run in Custom Jar mode, you need to look at the example that you want to run and figure out the arguments that you need to provide to the job.  Essentially, you need to know the command line that you would give to bin/hadoop in order to run the job, including whatever parameters the job needs to run.  

h2. Using the GUI

The EMR GUI is an easy way to start up a Custom Jar run, but it doesn't have the full functionality of the CLI.  Basically, you tell the GUI where in S3 the jar file is using a Hadoop s3n URI like {{s3n://PATH/mahout-core-0.4-SNAPSHOT.job}}.  The GUI will check and make sure that the given file exists, which is a nice sanity check.  You can then provide the arguments for the job just as you would on the command line.  The arguments for the k-means job that were as follows:

{noformat}
org.apache.mahout.clustering.kmeans.KMeansDriver --input s3n://news-vecs/part-out.vec --clusters s3n://news-vecs/kmeans/clusters-9-11/ -k 10 --output s3n://news-vecs/out-9-11/ --distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure --convergenceDelta 0.001 --overwrite --maxIter 50 --clustering
{noformat}

TODO: Screenshot

The main failing with the GUI mode is that you can only specify a single job to run, and you can't run another job in the same set of instances.  Recall that on AWS you pay for partial hours at the hourly rate, so if your job fails in the first 10 seconds, you pay for the full hour and if you try again, you're going to paying for another hour.

Because of this, using a command line interface (CLI) is strongly recommended.

h2. Using the CLI

If you're in development mode, and trying things out, EMR allows you to set up a set of instances and leave them running.  Once you've done this, you can add job steps to the set of instances as you like.  This solves the "10 second failure" problem that I described above and lets you get full value for your EMR dollar.  Amazon has pretty good [documentation for the CLI|http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?CHAP_RunningaJob.html], which you'll need to read to figure out how to do things like set up your AWS credentials for the EMR CLI.

You can start up a job flow that will keep running using an invocation like the following:

{noformat}
./elastic-mapreduce --create --alive \
   --log-uri s3n://PATH_FOR_LOGS/ --key_pair YOUR_KEY \
   --num-instances 2 --name NAME_HERE
{noformat}

Fill in the name, key pair and path for logs as appropriate. This call returns the name of the job flow, and you'll need that for subsequent calls to add steps to the job flow. You can, however, retrieve it at any time by calling:
{noformat}
./elastic-mapreduce --list
{noformat}

Let's list our job flows:

{noformat}
[stgreen@dhcp-ubur02-74-153 14:16:15 emr]$ ./elastic-mapreduce --list
j-3JB4UF7CQQ025     WAITING        ec2-174-129-90-97.compute-1.amazonaws.com    kmeans
{noformat}

At this point, everything's started up, and it's waiting for us to add a step to the job.  When we started the job flow, we specified a key pair that we created earlier so that we can log into the master while the job flow is running:

{noformat}
 elastic-mapreduce --ssh -j j-3JB4UF7CQQ025
{noformat}

Let's add a step to run a job:

{noformat}
 elastic-mapreduce -j j-3JB4UF7CQQ025  --jar s3n://PATH/mahout-core-0.4-SNAPSHOT.job  --main-class org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg s3n://PATH/part-out.vec --arg --clusters --arg s3n://PATH/kmeans/clusters/ --arg -k --arg 10 --arg --output --arg s3n://PATH/out-9-11/ --arg --distanceMeasure --arg  org.apache.mahout.common.distance.CosineDistanceMeasure --arg --convergenceDelta --arg 0.001 --arg --overwrite --arg --maxIter --arg 50 --arg --clustering
{noformat}

When you do this, the job flow goes into the {{RUNNING}} state for a while and then returns to {{WAITING}} once the step has finished.  You can use the CLI or the GUI to monitor the step while it runs.  Once you've finished with your job flow, you can shut it down the following way:

{noformat}
./elastic-mapreduce -j j-3JB4UF7CQQ025 --terminate
{noformat}

and go look in your S3 buckets to find your output and logs.


h1. Troubleshooting

The primary means for understanding what went wrong is via the logs and stderr/stdout.  When running on EMR, stderr and stdout are captured to files in your log directories.  Additionally, logging is setup to write out to a file called syslog.  To view these in the AWS Console, go to your logs directory, then the folder with the same JobFlow id as above (j-3JB4UF7CQQ025), then the steps folder and then the appropriate step number (usually 1 for this case).

That is, go to the folder s3n://PATH_TO_LOGS/j-3JB4UF7CQQ025/steps/1.  In this directory, you will find stdout, stderr, syslog and potentially a few other logs. 


See [resulting thread|http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945&tstart=15] for some early user experience with Mahout on EMR


Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action