Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Message-ID: <49E4BEF9.8050302@windwardsolutions.com>
Date: Tue, 14 Apr 2009 09:51:05 -0700
From: Jeff Eastman <jdog@windwardsolutions.com>
User-Agent: Thunderbird 2.0.0.21 (Macintosh/20090302)
MIME-Version: 1.0
To: mahout-user@lucene.apache.org
Subject: Re: Mahout on Elastic MapReduce
References: <D8B4531F-102B-4EBE-89D6-814DF688E615@sun.com>
In-Reply-To: <D8B4531F-102B-4EBE-89D6-814DF688E615@sun.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi Stephen,

You are out on the bleeding edge with EMR. I've been able to run the 
kmeans example directly on a small EC2 cluster that I started up myself 
(using the Hadoop src/contrib/ec2 scripts). I have not yet tried EMR 
(just got an account yesterday), but I see that it requires you to have 
your data in S3 as opposed to HDFS.

The job first runs the InputDriver to copy the raw test data into Mahout 
Vector external representation after deleting any pre-existing output 
files. It looks to me like the two delete() snippets you show are pretty 
equivalent. If you have no pre-existing output directory, the Mahout 
snippet won't attempt to delete it.

I too am at a loss to explain what you are seeing. If you can post more 
results I can try to help you read the tea leaves...
Jeff

Stephen Green wrote:
> I told some folks here at work that I would give a talk on Mahout for 
> our reading group and decided that I would use it as an opportunity to 
> try Amazon's Elastic MapReduce (EMR).
>
> I downloaded and untarred Hadoop 0.18.3, which is the version that 
> Amazon claims they have running so that I could try things out here.   
> I can start up Hadoop and sucessfully run a KMeans cluster on the 
> synthetic control data using the instructions on the wiki and the 
> following command line:
>
> bin/hadoop jar 
> ~/Projects/EC2/mahout-0.1/examples/target/mahout-examples-0.1.job 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job 
> input/testdata output org.apache.mahout.utils.EuclideanDistanceMeasure 
> 80 55 0.5 10
>
> I realize there's a shorter invocation, but I'm trying to figure out 
> what Amazon needs to run this, so I'm pulled the default arguments 
> from the KMeans job.
>
> Now, on Amazon, you can specify a jar file that gets run with 
> "bin/hadoop jar" and you also specify the arguments that will be used 
> with that jar file.
>
> The trick is that the input and output data need to be in S3 buckets 
> and you need to specify the locations with S3 native URIs.  I used the 
> command line interface to EMR to create a job like so:
>
> elastic-mapreduce -v --create --name KMeans --num-instances 1 \
>     --jar s3n://mahout-code/mahout-examples-0.1.job \
>     --main-class 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job \
>     --arg s3n://mahout-input/testdata \
>     --arg s3n://mahout-output \
>     --arg org.apache.mahout.utils.EuclideanDistanceMeasure \
>     --arg 80 --arg 55 --arg 0.5 --arg 10
>
> But this fails with the message:  Steps completed with errors.  Turns 
> out you can have the EMR infrastructure dump the logs for the tasks 
> and looking at the stderr for step 1 I see:
>
> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output, 
> expected: hdfs://domU-12-31-39-00-ED-51.compute-1
> .internal:9000
>         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
>         at 
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84) 
>
>         at 
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140) 
>
>         at 
> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408) 
>
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
>         at 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:77) 
>
>         at 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:44) 
>
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
>
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
>
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
> If I replace the s3n URI for the output with just mahout-output the 
> code appears to run without incident  (at least the log output looks 
> like the log output from my local run.)  Unfortunately, the HDFS 
> instance into which it's put disappears in a puff of smoke when the 
> job finishes running.
>
> Now, I am by no means a Hadoop expert, but it seems like if it can 
> load the data from an s3n input URI, then it probably has the right 
> classes in there to do that (in fact, it looks like the jets3t jar is 
> in the .job file three times!), so it seems like the KMeans job from 
> mahout should be happy to use an s3n output URI, but I'm clearly 
> misunderstanding something here.
>
> One of the EMR samples is a Java DNA sequence matching thing 
> (CloudBurst), which seems to work fine with an s3n URI for the 
> output.  The setup for it's output looks like the following:
>
>         Path oPath = new Path(outpath);
>         FileOutputFormat.setOutputPath(conf, oPath);
>         System.err.println("  Removing old results");
>         FileSystem.get(conf).delete(oPath);
>
> where "conf" is of type org.apache.hadoop.mapred.JobConf.  This is a 
> bit different than what happens in the KMeans job:
>
>     Path outPath = new Path(output);
>     client.setConf(conf);
>     FileSystem dfs = FileSystem.get(conf);
>     if (dfs.exists(outPath))
>       dfs.delete(outPath, true);
>
> Trying to use the CloudBurst idiom in the KMeans job produced no joy.  
> Any help would be greatly appreciated.
>
> Steve Green