Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 50309 invoked from network); 14 Apr 2009 16:51:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 Apr 2009 16:51:43 -0000 Received: (qmail 97710 invoked by uid 500); 14 Apr 2009 16:51:42 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 97628 invoked by uid 500); 14 Apr 2009 16:51:42 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 97618 invoked by uid 99); 14 Apr 2009 16:51:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Apr 2009 16:51:42 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.208.4.195] (HELO mout.perfora.net) (74.208.4.195) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Apr 2009 16:51:34 +0000 Received: from jeff-eastmans-macbook-pro.local (c-71-198-3-140.hsd1.ca.comcast.net [71.198.3.140]) by mrelay.perfora.net (node=mrus0) with ESMTP (Nemesis) id 0MKp8S-1Ltlqm0WkZ-000g8f; Tue, 14 Apr 2009 12:51:09 -0400 Received: from jeff-eastmans-macbook-pro.local by jeff-eastmans-macbook-pro.local (PGP Universal service); Tue, 14 Apr 2009 09:51:09 -0700 X-PGP-Universal: processed; by jeff-eastmans-macbook-pro.local on Tue, 14 Apr 2009 09:51:09 -0700 Message-ID: <49E4BEF9.8050302@windwardsolutions.com> Date: Tue, 14 Apr 2009 09:51:05 -0700 From: Jeff Eastman User-Agent: Thunderbird 2.0.0.21 (Macintosh/20090302) MIME-Version: 1.0 To: mahout-user@lucene.apache.org Subject: Re: Mahout on Elastic MapReduce References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V01U2FsdGVkX19C0gcVH3OUxNYwLwN8CcVyXTyHvotAsz2hA+t fJfBu3fvlAlAuhfa0ozWQEXt4dj872KC9JENtcM9WlhAHqOkD+ 4qNhIBFqfrhNMJsAcUpYpMF70WXdFeLw1VAuCUOaQc= X-Virus-Checked: Checked by ClamAV on apache.org Hi Stephen, You are out on the bleeding edge with EMR. I've been able to run the kmeans example directly on a small EC2 cluster that I started up myself (using the Hadoop src/contrib/ec2 scripts). I have not yet tried EMR (just got an account yesterday), but I see that it requires you to have your data in S3 as opposed to HDFS. The job first runs the InputDriver to copy the raw test data into Mahout Vector external representation after deleting any pre-existing output files. It looks to me like the two delete() snippets you show are pretty equivalent. If you have no pre-existing output directory, the Mahout snippet won't attempt to delete it. I too am at a loss to explain what you are seeing. If you can post more results I can try to help you read the tea leaves... Jeff Stephen Green wrote: > I told some folks here at work that I would give a talk on Mahout for > our reading group and decided that I would use it as an opportunity to > try Amazon's Elastic MapReduce (EMR). > > I downloaded and untarred Hadoop 0.18.3, which is the version that > Amazon claims they have running so that I could try things out here. > I can start up Hadoop and sucessfully run a KMeans cluster on the > synthetic control data using the instructions on the wiki and the > following command line: > > bin/hadoop jar > ~/Projects/EC2/mahout-0.1/examples/target/mahout-examples-0.1.job > org.apache.mahout.clustering.syntheticcontrol.kmeans.Job > input/testdata output org.apache.mahout.utils.EuclideanDistanceMeasure > 80 55 0.5 10 > > I realize there's a shorter invocation, but I'm trying to figure out > what Amazon needs to run this, so I'm pulled the default arguments > from the KMeans job. > > Now, on Amazon, you can specify a jar file that gets run with > "bin/hadoop jar" and you also specify the arguments that will be used > with that jar file. > > The trick is that the input and output data need to be in S3 buckets > and you need to specify the locations with S3 native URIs. I used the > command line interface to EMR to create a job like so: > > elastic-mapreduce -v --create --name KMeans --num-instances 1 \ > --jar s3n://mahout-code/mahout-examples-0.1.job \ > --main-class > org.apache.mahout.clustering.syntheticcontrol.kmeans.Job \ > --arg s3n://mahout-input/testdata \ > --arg s3n://mahout-output \ > --arg org.apache.mahout.utils.EuclideanDistanceMeasure \ > --arg 80 --arg 55 --arg 0.5 --arg 10 > > But this fails with the message: Steps completed with errors. Turns > out you can have the EMR infrastructure dump the logs for the tasks > and looking at the stderr for step 1 I see: > > java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output, > expected: hdfs://domU-12-31-39-00-ED-51.compute-1 > .internal:9000 > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320) > at > org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84) > > at > org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140) > > at > org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408) > > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667) > at > org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:77) > > at > org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:44) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:155) > at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) > > If I replace the s3n URI for the output with just mahout-output the > code appears to run without incident (at least the log output looks > like the log output from my local run.) Unfortunately, the HDFS > instance into which it's put disappears in a puff of smoke when the > job finishes running. > > Now, I am by no means a Hadoop expert, but it seems like if it can > load the data from an s3n input URI, then it probably has the right > classes in there to do that (in fact, it looks like the jets3t jar is > in the .job file three times!), so it seems like the KMeans job from > mahout should be happy to use an s3n output URI, but I'm clearly > misunderstanding something here. > > One of the EMR samples is a Java DNA sequence matching thing > (CloudBurst), which seems to work fine with an s3n URI for the > output. The setup for it's output looks like the following: > > Path oPath = new Path(outpath); > FileOutputFormat.setOutputPath(conf, oPath); > System.err.println(" Removing old results"); > FileSystem.get(conf).delete(oPath); > > where "conf" is of type org.apache.hadoop.mapred.JobConf. This is a > bit different than what happens in the KMeans job: > > Path outPath = new Path(output); > client.setConf(conf); > FileSystem dfs = FileSystem.get(conf); > if (dfs.exists(outPath)) > dfs.delete(outPath, true); > > Trying to use the CloudBurst idiom in the KMeans job produced no joy. > Any help would be greatly appreciated. > > Steve Green