mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Vyas <jayunit...@gmail.com>
Subject Re: Problem with K-Means clustering on Amazon EMR
Date Sun, 16 Mar 2014 15:41:06 GMT
I specifically have fixed mapreduce jobs by doing what the error message suggests.

But maybe (hopefully) there is another workaround that is configuration driven.

Just a hunch but, Maybe mahout needs to be refactored to create fs objects using the get(uri,conf)
calls?

As hadoop evolves to support different flavored of hcfs probably using API calls that are
more flexible (i.e. Like the fs.get(uri,conf) one), will probably be a good thing to keep
in mind.

> On Mar 16, 2014, at 9:22 AM, Frank Scholten <frank@frankscholten.nl> wrote:
> 
> Hi Konstantin,
> 
> Good to hear from you.
> 
> The link you mentioned points to EigenSeedGenerator not
> RandomSeedGenerator. The problem seems to be with the call to
> 
> fs.getFileStatus(input).isDir()
> 
> 
> It's been a while and I don't remember but perhaps you have to set
> additional Hadoop fs properties to use S3. See
> https://wiki.apache.org/hadoop/AmazonS3. Perhaps you isolate the cause of
> this by creating a small Java main app with that line of code and run it in
> the debugger.
> 
> Cheers,
> 
> Frank
> 
> 
> 
> On Sun, Mar 16, 2014 at 12:07 PM, Konstantin Slisenko
> <kslisenko@gmail.com>wrote:
> 
>> Hello!
>> 
>> I run a text-documents clustering on Hadoop cluster in Amazon Elastic Map
>> Reduce. As input and output I use S3 Amazon file system. I specify all
>> paths as "s3://bucket-name/folder-name".
>> 
>> SparceVectorsFromSequenceFile works correctly with S3
>> but when I start K-Means clustering job, I get this error:
>> 
>> Exception in thread "main" java.lang.IllegalArgumentException: This
>> file system object (hdfs://172.31.41.65:9000) does not support access
>> to the request path
>> 
>> 's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors'
>> You possibly called FileSystem.get(conf) when you should have called
>> FileSystem.get(uri, conf) to obtain a file system supporting your
>> path.
>> 
>>        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375)
>>        at
>> org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106)
>>        at
>> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162)
>>        at
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530)
>>        at
>> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:76)
>>        at
>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:93)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at
>> bbuzz2011.stackoverflow.runner.RunnerWithInParams.cluster(RunnerWithInParams.java:121)
>>        at
>> bbuzz2011.stackoverflow.runner.RunnerWithInParams.run(RunnerWithInParams.java:52)cause
>> of this a
>>        at
>> bbuzz2011.stackoverflow.runner.RunnerWithInParams.main(RunnerWithInParams.java:41)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>> 
>> 
>> I checked RandomSeedGenerator.buildRandom
>> (
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/kmeans/EigenSeedGenerator.java?av=f
>> )
>> and I assume it has correct code:
>> 
>> FileSystem fs = FileSystem.get(output.toUri(), conf);
>> 
>> 
>> I can not run clustering because of this error. May be you have any
>> ideas how to fix this?
>> 

Mime
View raw message