mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Ingles <p...@oobaloo.co.uk>
Subject Re: Error with KMeans example in trunk (793689)
Date Tue, 14 Jul 2009 12:23:15 GMT
I noticed it was using 0.20.0 this morning and gave it a go. I think  
it failed at the Clustering phases with a NoClassDef error for the  
GSon stuff, but I don't remember exactly.

I'm running from an earlier revision against 0.19 at the moment, but  
will try 0.20 again when it's finished and let you know how it goes.

Thanks again,
Paul

On 14 Jul 2009, at 12:58, Grant Ingersoll wrote:

> Try Hadoop 0.20.0, which is what trunk is now on.  I will update the  
> docs.
>
>
> On Jul 13, 2009, at 7:02 PM, Paul Ingles wrote:
>
>> Hi,
>>
>> I've been going over the kmeans stuff the last few days to try and  
>> understand how it works, and how I might extend it to work with the  
>> data I'm looking to process. It's taken me a while to get a basic  
>> understanding of things, and really appreciate having lists like  
>> this around for support.
>>
>> I need to be able to label the vectors: each vector holds (for a  
>> document) a set of similarity scores across a number of attributes.  
>> I did some searching around payloads (after coming across the term  
>> in some comments) but couldn't see how I add a payload to the  
>> Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65

>> ) that mentions the addition of the setName method to Vector. I've  
>> tried building trunk, and although there were a few test failures  
>> for other (seemingly unrelated) examples I continued and managed to  
>> get the mahout-examples jar/job files built to give it a whirl.
>>
>> When I run the following:
>>
>> $ hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.job  
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>>
>> I see it run the "Preparing Input", "Running Canopy to get initial  
>> clusters", and then finally it starts "Running KMeans". But,  
>> shortly after it breaks with the following trace:
>>
>> ---snip---
>> Running KMeans
>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Input: output/data  
>> Clusters In: output/canopies Out: output Distance:  
>> org.apache.mahout.utils.EuclideanDistanceMeasure
>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: convergence: 0.5 max  
>> Iterations: 10 num Reduce Tasks: 1 Input Vectors:  
>> org.apache.mahout.matrix.SparseVector
>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Iteration 0
>> 09/07/13 23:49:34 WARN mapred.JobClient: Use GenericOptionsParser  
>> for parsing the arguments. Applications should implement Tool for  
>> the same.
>> 09/07/13 23:49:34 INFO mapred.FileInputFormat: Total input paths to  
>> process : 2
>> 09/07/13 23:49:34 INFO mapred.JobClient: Running job:  
>> job_200907132019_0040
>> 09/07/13 23:49:35 INFO mapred.JobClient:  map 0% reduce 0%
>> 09/07/13 23:49:42 INFO mapred.JobClient:  map 50% reduce 0%
>> 09/07/13 23:49:43 INFO mapred.JobClient:  map 100% reduce 0%
>> 09/07/13 23:49:49 INFO mapred.JobClient:  map 100% reduce 100%
>> 09/07/13 23:49:50 INFO mapred.JobClient: Job complete:  
>> job_200907132019_0040
>> 09/07/13 23:49:50 INFO mapred.JobClient: Counters: 16
>> 09/07/13 23:49:50 INFO mapred.JobClient:   File Systems
>> 09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes read=465629
>> 09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes written=5631
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes read=7806
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes  
>> written=15674
>> 09/07/13 23:49:50 INFO mapred.JobClient:   Job Counters
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Launched reduce tasks=1
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Launched map tasks=2
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Data-local map tasks=2
>> 09/07/13 23:49:50 INFO mapred.JobClient:   Map-Reduce Framework
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input groups=7
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Combine output  
>> records=10
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Map input records=600
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Reduce output records=7
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Map output bytes=465600
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Map input bytes=448580
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Combine input  
>> records=600
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Map output records=600
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input records=10
>> 09/07/13 23:49:50 WARN kmeans.KMeansDriver: java.io.IOException:  
>> Cannot open filename /user/paul/output/clusters-0/_logs
>> java.io.IOException: Cannot open filename /user/paul/output/ 
>> clusters-0/_logs
>> 	at org.apache.hadoop.hdfs.DFSClient 
>> $DFSInputStream.openInfo(DFSClient.java:1394)
>> 	at org.apache.hadoop.hdfs.DFSClient 
>> $DFSInputStream.<init>(DFSClient.java:1385)
>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:338)
>> 	at  
>> org 
>> .apache 
>> .hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java: 
>> 171)
>> 	at org.apache.hadoop.io.SequenceFile 
>> $Reader.openFile(SequenceFile.java:1437)
>> 	at org.apache.hadoop.io.SequenceFile 
>> $Reader.<init>(SequenceFile.java:1424)
>> 	at org.apache.hadoop.io.SequenceFile 
>> $Reader.<init>(SequenceFile.java:1417)
>> 	at org.apache.hadoop.io.SequenceFile 
>> $Reader.<init>(SequenceFile.java:1412)
>> 	at  
>> org 
>> .apache 
>> .mahout 
>> .clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:304)
>> 	at  
>> org 
>> .apache 
>> .mahout 
>> .clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:241)
>> 	at  
>> org 
>> .apache 
>> .mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:194)
>> 	at  
>> org 
>> .apache 
>> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:100)
>> 	at  
>> org 
>> .apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 
>> 56)
>> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> 	at  
>> sun 
>> .reflect 
>> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> 	at  
>> sun 
>> .reflect 
>> .DelegatingMethodAccessorImpl 
>> .invoke(DelegatingMethodAccessorImpl.java:25)
>> 	at java.lang.reflect.Method.invoke(Method.java:597)
>> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>> 	at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> 	at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>> ---snip---
>>
>> This is against revision 793689, running on my development Mac Pro  
>> (pseudo-distributed single node) with Hadoop 0.19.1.
>>
>> It's a bit late to be digging through what's going on, but will try  
>> and take a look tomorrow- really excited about giving kmeans a  
>> whirl on the document processing I'm playing with. In the meantime,  
>> I was wondering whether anyone else had seen the same, or knew a  
>> way to accomplish something similar with the released version (or  
>> point me to a past good revision perhaps?)
>>
>> Thanks again,
>> Paul
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>


Mime
View raw message