Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Message-Id: <6549C553-630B-423D-A5FC-5E9B630489D9@apache.org>
From: Grant Ingersoll <gsingers@apache.org>
To: mahout-user@lucene.apache.org
In-Reply-To: <314571.61307.qm@web50302.mail.re2.yahoo.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v930.3)
Subject: Re: Mahout on Elastic MapReduce
Date: Tue, 14 Apr 2009 17:17:23 -0400
References: <D8B4531F-102B-4EBE-89D6-814DF688E615@sun.com>
  <49E4BEF9.8050302@windwardsolutions.com>
  <57E9288D-DC7D-413F-BC7F-CC9BDA538752@sun.com>
  <600121FD-93C3-4D90-81A8-82F945BEAF82@sun.com>
 <e2e029610904141319r453833c6xb5d1dcf36de51368@mail.gmail.com>
 <314571.61307.qm@web50302.mail.re2.yahoo.com>

I would be concerned about the fact that EMR is using 0.18 and Mahout  
is on 0.19 (which of course raises another concern expressed by Owen  
O'Malley to me at ApacheCon: No one uses 0.19)

I'd say you should try reproducing the problem on the same version  
that Mahout uses.

FWIW, any committer on the Mahout project can likely get credits to  
use AWS.

On Apr 14, 2009, at 5:08 PM, Otis Gospodnetic wrote:

>
> Hadoop should be able to read directly from S3, I believe: http://wiki.apache.org/hadoop/AmazonS3
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Sean Owen <srowen@gmail.com>
>> To: mahout-user@lucene.apache.org
>> Sent: Tuesday, April 14, 2009 4:19:51 PM
>> Subject: Re: Mahout on Elastic MapReduce
>>
>> This is a fairly uninformed observation, but: the error seems to be
>> from Hadoop. It seems to say that it understands hdfs:, but not s3n:,
>> and that makes sense to me. Do we expect Hadoop understands how to
>> read from S3? I would expect not. (Though, you point to examples that
>> seem to overcome this just fine?)
>>
>> When I have integrated code with stuff stored on S3, I have always  
>> had
>> to write extra glue code to copy from S3 to a local file system, do
>> work, then copy back.
>>
>> On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green wrote:
>>>
>>> On Apr 14, 2009, at 2:41 PM, Stephen Green wrote:
>>>
>>>>
>>>> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
>>>>
>>>>> Hi Stephen,
>>>>>
>>>>> You are out on the bleeding edge with EMR.
>>>>
>>>> Yeah, but the view is lovely from here!
>>>>
>>>>> I've been able to run the kmeans example directly on a small EC2  
>>>>> cluster
>>>>> that I started up myself (using the Hadoop src/contrib/ec2  
>>>>> scripts). I have
>>>>> not yet tried EMR (just got an account yesterday), but I see  
>>>>> that it
>>>>> requires you to have your data in S3 as opposed to HDFS.
>>>>>
>>>>> The job first runs the InputDriver to copy the raw test data  
>>>>> into Mahout
>>>>> Vector external representation after deleting any pre-existing  
>>>>> output files.
>>>>> It looks to me like the two delete() snippets you show are pretty
>>>>> equivalent. If you have no pre-existing output directory, the  
>>>>> Mahout snippet
>>>>> won't attempt to delete it.
>>>>
>>>> I managed to figure that out :-)  I'm pretty comfortable with the  
>>>> ideas
>>>> behind MapReduce, but being confronted with my first Job is a bit  
>>>> more
>>>> daunting than I expected.
>>>>
>>>>> I too am at a loss to explain what you are seeing. If you can  
>>>>> post more
>>>>> results I can try to help you read the tea leaves...
>>>>
>>>> I noticed that the CloudBurst job just deleted the directory  
>>>> without
>>>> checking for existence and so I tried the same thing with Mahout:
>>>>
>>>> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
>>>> expected: hdfs://domU-12-31-38-00-6C-86.compute-1
>>>> .internal:9000
>>>>      at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java: 
>>>> 320)
>>>>      at
>>>>
>> org 
>> .apache 
>> .hadoop 
>> .dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>>>>      at
>>>>
>> org 
>> .apache 
>> .hadoop 
>> .dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java: 
>> 140)
>>>>      at
>>>>
>> org 
>> .apache 
>> .hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java: 
>> 210)
>>>>      at
>>>> org 
>>>> .apache 
>>>> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
>>>>      at
>>>> org 
>>>> .apache 
>>>> .mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:46)
>>>>
>>>> So no joy there.
>>>>
>>>> Should I see if I can isolate this as an s3n problem?  I suppose  
>>>> I could
>>>> try running the Hadoop job locally with it reading and writing  
>>>> the data from
>>>> S3 and see if it suffers from the same problem.  At least then I  
>>>> could debug
>>>> inside Hadoop.
>>>>
>>>> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n
>>>> problem it might have been fixed already.  That doesn't help much  
>>>> running on
>>>> EMR, I guess.
>>>>
>>>> I'm also going to start a run on EMR that does away with the whole
>>>> exists/delete check and see if that works.
>>>
>>> Following up to myself (my wife will tell you that I talk to  
>>> myself!)  I
>>> removed a number of the exists/delete checks:  in  
>>> CanopyClusteringJob,
>>> CanopyDriver, KMeansDriver, and ClusterDriver.  This allowed the  
>>> jobs to
>>> progress, but they died the death a little later with the following
>>> exception (and a few more, I can send the whole log if you like):
>>>
>>> java.lang.IllegalArgumentException: Wrong FS:
>>> s3n://mahoutput/canopies/part-00000, expected:
>>> hdfs://domU-12-31-39-00-A5-44.compute-1.internal:9000
>>>       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java: 
>>> 320)
>>>       at
>>>
>> org 
>> .apache 
>> .hadoop 
>> .dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>>>       at
>>>
>> org 
>> .apache 
>> .hadoop 
>> .dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java: 
>> 140)
>>>       at
>>>
>> org 
>> .apache 
>> .hadoop 
>> .dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java: 
>> 408)
>>>       at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java: 
>>> 695)
>>>       at
>>> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1420)
>>>       at
>>> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1415)
>>>       at
>>>
>> org 
>> .apache 
>> .mahout 
>> .clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69)
>>>       at
>>> org 
>>> .apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>>>       at
>>> org 
>>> .apache 
>>> .hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>>>       at  
>>> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
>>>       at
>>> org 
>>> .apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>>>       at
>>> org 
>>> .apache 
>>> .hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>>>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
>>>       at
>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 
>>> 2198)
>>>
>>> Looking at the exception message there, I would almost swear that  
>>> it things
>>> the whole s3n path is the name of a FS that it doesn't know about,  
>>> but that
>>> might just be a bad message.  This message repeats a few times  
>>> (retrying
>>> failed mappers, I guess?) and then the job fails.
>>>
>>> One thing that occurred to me:  the mahout examples job has the  
>>> hadoop
>>> 0.19.1 core jar in it.  Could I be seeing some kind of version  
>>> skew between
>>> the hadoop in the job file and the one on EMR?  Although it worked  
>>> fine with
>>> a local 0.18.3, so maybe not.
>>>
>>> I'm going to see if I can get the stock Mahout to run with s3n  
>>> inputs and
>>> outputs tomorrow and I'll let you all know how that goes.
>>>
>>> Steve
>>> --
>>> Stephen Green                      //   Stephen.Green@sun.com
>>> Principal Investigator             \\   http://blogs.sun.com/searchguy
>>> Aura Project                       //   Voice: +1 781-442-0926
>>> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>>>
>>>
>>>
>>>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search