Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type;
  b=MudZ1LXGEbEP4e+yt3yfLCtWJ9EFbbl51FKUC6W/eZkQsbRqjBGTnkmws3yOCXsrrpRONg4aKH7LRR4E9Q+qLnFUUnvxBHbJ5x0tKs8buamczNPlKAAbB3RNVGB07jMReSZEBnOMmduMr5kpeN5dl2yamT4VWpNGdrsuTvDwYXo=;
Message-ID: <314571.61307.qm@web50302.mail.re2.yahoo.com>
References: <D8B4531F-102B-4EBE-89D6-814DF688E615@sun.com>
  <49E4BEF9.8050302@windwardsolutions.com>
  <57E9288D-DC7D-413F-BC7F-CC9BDA538752@sun.com>
  <600121FD-93C3-4D90-81A8-82F945BEAF82@sun.com>
 <e2e029610904141319r453833c6xb5d1dcf36de51368@mail.gmail.com>
Date: Tue, 14 Apr 2009 14:08:19 -0700 (PDT)
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
Subject: Re: Mahout on Elastic MapReduce
To: mahout-user@lucene.apache.org
In-Reply-To: <e2e029610904141319r453833c6xb5d1dcf36de51368@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii


Hadoop should be able to read directly from S3, I believe: http://wiki.apache.org/hadoop/AmazonS3

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Sean Owen <srowen@gmail.com>
> To: mahout-user@lucene.apache.org
> Sent: Tuesday, April 14, 2009 4:19:51 PM
> Subject: Re: Mahout on Elastic MapReduce
> 
> This is a fairly uninformed observation, but: the error seems to be
> from Hadoop. It seems to say that it understands hdfs:, but not s3n:,
> and that makes sense to me. Do we expect Hadoop understands how to
> read from S3? I would expect not. (Though, you point to examples that
> seem to overcome this just fine?)
> 
> When I have integrated code with stuff stored on S3, I have always had
> to write extra glue code to copy from S3 to a local file system, do
> work, then copy back.
> 
> On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green wrote:
> >
> > On Apr 14, 2009, at 2:41 PM, Stephen Green wrote:
> >
> >>
> >> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
> >>
> >>> Hi Stephen,
> >>>
> >>> You are out on the bleeding edge with EMR.
> >>
> >> Yeah, but the view is lovely from here!
> >>
> >>> I've been able to run the kmeans example directly on a small EC2 cluster
> >>> that I started up myself (using the Hadoop src/contrib/ec2 scripts). I have
> >>> not yet tried EMR (just got an account yesterday), but I see that it
> >>> requires you to have your data in S3 as opposed to HDFS.
> >>>
> >>> The job first runs the InputDriver to copy the raw test data into Mahout
> >>> Vector external representation after deleting any pre-existing output files.
> >>> It looks to me like the two delete() snippets you show are pretty
> >>> equivalent. If you have no pre-existing output directory, the Mahout snippet
> >>> won't attempt to delete it.
> >>
> >> I managed to figure that out :-)  I'm pretty comfortable with the ideas
> >> behind MapReduce, but being confronted with my first Job is a bit more
> >> daunting than I expected.
> >>
> >>> I too am at a loss to explain what you are seeing. If you can post more
> >>> results I can try to help you read the tea leaves...
> >>
> >> I noticed that the CloudBurst job just deleted the directory without
> >> checking for existence and so I tried the same thing with Mahout:
> >>
> >> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
> >> expected: hdfs://domU-12-31-38-00-6C-86.compute-1
> >> .internal:9000
> >>       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
> >>       at
> >> 
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
> >>       at
> >> 
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
> >>       at
> >> 
> org.apache.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:210)
> >>       at
> >> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
> >>       at
> >> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:46)
> >>
> >> So no joy there.
> >>
> >> Should I see if I can isolate this as an s3n problem?  I suppose I could
> >> try running the Hadoop job locally with it reading and writing the data from
> >> S3 and see if it suffers from the same problem.  At least then I could debug
> >> inside Hadoop.
> >>
> >> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n
> >> problem it might have been fixed already.  That doesn't help much running on
> >> EMR, I guess.
> >>
> >> I'm also going to start a run on EMR that does away with the whole
> >> exists/delete check and see if that works.
> >
> > Following up to myself (my wife will tell you that I talk to myself!)  I
> > removed a number of the exists/delete checks:  in CanopyClusteringJob,
> > CanopyDriver, KMeansDriver, and ClusterDriver.  This allowed the jobs to
> > progress, but they died the death a little later with the following
> > exception (and a few more, I can send the whole log if you like):
> >
> > java.lang.IllegalArgumentException: Wrong FS:
> > s3n://mahoutput/canopies/part-00000, expected:
> > hdfs://domU-12-31-39-00-A5-44.compute-1.internal:9000
> >        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
> >        at
> > 
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
> >        at
> > 
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
> >        at
> > 
> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
> >        at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:695)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1420)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1415)
> >        at
> > 
> org.apache.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69)
> >        at
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
> >        at
> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
> >        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
> >        at
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
> >        at
> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
> >        at
> > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
> >
> > Looking at the exception message there, I would almost swear that it things
> > the whole s3n path is the name of a FS that it doesn't know about, but that
> > might just be a bad message.  This message repeats a few times (retrying
> > failed mappers, I guess?) and then the job fails.
> >
> > One thing that occurred to me:  the mahout examples job has the hadoop
> > 0.19.1 core jar in it.  Could I be seeing some kind of version skew between
> > the hadoop in the job file and the one on EMR?  Although it worked fine with
> > a local 0.18.3, so maybe not.
> >
> > I'm going to see if I can get the stock Mahout to run with s3n inputs and
> > outputs tomorrow and I'll let you all know how that goes.
> >
> > Steve
> > --
> > Stephen Green                      //   Stephen.Green@sun.com
> > Principal Investigator             \\   http://blogs.sun.com/searchguy
> > Aura Project                       //   Voice: +1 781-442-0926
> > Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
> >
> >
> >
> >