mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Mahout on Elastic MapReduce
Date Thu, 16 Apr 2009 15:10:05 GMT
Hi Stephen,

It looks to me like you are on the right track. The original kMeans code 
and job patterns were written over a year ago, probably on a version of 
Hadoop 10 or 11 IIRC. They have made significant changes to the file 
system in the interim and nobody - except you - has tried to run kMeans 
on EMR.

The logic about using the incorrect file system method is sound, and 
your fix seems like it should work. I don't expect the hadoop version 
differences to impact you since kMeans has not been updated recently to 
take advantage of hadoop improvements.

It certainly seems like dfs.exists(outPath) should be false if you have 
no outPath. You have a sharp machete and are making good progress 
breaking a jungle trail to EMR. If you'd like to chat on the phone or 
Skype, please contact me directly (jeff at windwardsolutions dot com).

Jeff


Stephen Green wrote:
> A bit more progress.  I asked about this problem on Amazon's EMR 
> forums.  Here's the thread:
>
> http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945
>
> The answer from Amazon was:
>
>> This appears to be an issue with Mahout. This exception is fairly 
>> common and matches the pattern of "Wrong FS: s3n://*/, expected: 
>> hdfs://*:9000". This occurs when you try and use an S3N path with 
>> HDFS. Typically this occurs because the code asks for the wrong 
>> FileSystem.
>>
>> This could happen because a developer used the wrong static method on 
>> Hadoop's FileSystem class:
>>
>> http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/fs/FileSystem.html

>>
>>
>> If you call FileSystem.get(Configuration conf) you'll get an instance 
>> of the cluster's default file system, which in our case is HDFS. 
>> Instead, if you have a URI and want a reference to the FileSystem 
>> that URI points to, you should call the method FileSystem.get(URI 
>> uri, Configuration conf).
>>
>
> He offered a solution that involved using DistCp to copy data from S3 
> to HDFS and then back again, but since I have the Mahout source, I 
> decided to pursue things a bit further.  I went into the source and 
> modified the places where the filesystem is fetched to do the following:
>
>     FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
>
> (there were 3 places that I changed it, but I expect there are more 
> lying around.)  This is the idiom used by the CloudBurst example on EMR.
>
> Making this change fixes the exception that I was getting, but I'm now 
> getting a different exception:
>
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:310)

>
>         at 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83) 
>
>         at 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:45) 
>
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
>
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

>
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
> (The line numbers in kmeans.Job are weird because I added logging.)
>
> If the Hadoop on EMR is really 0.18.3, then the null pointer here is 
> the store in the NativeS3FileSystem.  But there's another problem:  I 
> deleted the output path before I started the run, so the existence 
> check should have failed and dfs.delete never should have been 
> called.  I added a bit of logging to the KMeans job and here's what it 
> says about the output path:
>
> 2009-04-16 14:04:35,757 INFO 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): dfs: c
> lass org.apache.hadoop.fs.s3native.NativeS3FileSystem
>
> So it got the right output file system type.
>
> 2009-04-16 14:04:35,758 INFO 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): s3n://
> mahout-output/ exists: true
>
> Shouldn't dfs.exists(outPath) fail for a non-existent path?  And 
> didn't the store have to exist (i.e., be non-null) for it to figure 
> this out?  I guess this really is starting to verge into base hadoop 
> territory.
>
> I'm rapidly getting to the point where I need to solve this one just 
> to prove to myself that I can get it to run!
>
> Steve


Mime
View raw message