mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Problems with KMeans clustering
Date Tue, 28 Oct 2008 01:49:42 GMT
OK, I can confirm that the exact same code works with 0.17.2 and not  
w/ 0.18.1.  So, it sounds like a bug in Hadoop, or we are relying on  
incorrect behavior in Hadoop.


On Oct 27, 2008, at 9:33 PM, Grant Ingersoll wrote:

>
> On Oct 26, 2008, at 10:46 AM, Philippe Lamarche wrote:
>
>> Unfortunately, I went straight from 0.17.2 to 0.18.1.  It was  
>> working on
>> 0.17.2.
>>
>
> BTW, are you saying the same exact code was working on 0.17.2 or are  
> you referring to some older Mahout code that worked on 17.2?
>
>
>>
>>
>> On Sun, Oct 26, 2008 at 9:48 AM, Grant Ingersoll  
>> <gsingers@apache.org>wrote:
>>
>>> Did this work with 0.18.0 or other prior versions for you?
>>>
>>>
>>>
>>> On Oct 25, 2008, at 7:23 PM, Philippe Lamarche wrote:
>>>
>>> Hi,
>>>>
>>>> I just updated to hadoop 0.18.1 and got a clean version of mahout  
>>>> from
>>>> svn.
>>>> However, I am having problems with KMeans, that can be traced  
>>>> down to :
>>>>
>>>> 2008-10-25 19:10:16,987 INFO org.apache.hadoop.mapred.Merger:  
>>>> Merging
>>>> 2 sorted segments
>>>> 2008-10-25 19:10:16,987 INFO org.apache.hadoop.mapred.Merger:  
>>>> Down to
>>>> the last merge-pass, with 2 segments left of total size: 5011 bytes
>>>> 2008-10-25 19:10:16,999 WARN org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_200810251826_0013_r_000000_0 Merge of the inmemory files  
>>>> threw
>>>> an exception: java.io.IOException: Intermedate merge failed
>>>>      at
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>>>> $InMemFSMergeThread.doInMemMerge(ReduceTask.java:2147)
>>>>      at
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>>>> $InMemFSMergeThread.run(ReduceTask.java:2078)
>>>> Caused by: java.lang.NumberFormatException: For input string: "["
>>>>      at
>>>> sun 
>>>> .misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java: 
>>>> 1224)
>>>>      at java.lang.Double.parseDouble(Double.java:510)
>>>>      at
>>>> org 
>>>> .apache.mahout.matrix.DenseVector.decodeFormat(DenseVector.java:60)
>>>>      at
>>>> org 
>>>> .apache 
>>>> .mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:256)
>>>>      at
>>>> org 
>>>> .apache 
>>>> .mahout 
>>>> .clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner.java:38)
>>>>      at
>>>> org 
>>>> .apache 
>>>> .mahout 
>>>> .clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner.java:31)
>>>>      at
>>>> org.apache.hadoop.mapred.ReduceTask 
>>>> $ReduceCopier.combineAndSpill(ReduceTask.java:2174)
>>>>      at
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access 
>>>> $3100(ReduceTask.java:341)
>>>>      at
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>>>> $InMemFSMergeThread.doInMemMerge(ReduceTask.java:2134)
>>>>      ... 1 more
>>>>
>>>> 2008-10-25 19:10:16,999 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> In-memory merge complete: 0 files left.
>>>> 2008-10-25 19:10:17,000 WARN org.apache.hadoop.mapred.TaskTracker:
>>>> Error running child
>>>> java.io.IOException: attempt_200810251826_0013_r_000000_0The reduce
>>>> copier failed
>>>>      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java: 
>>>> 255)
>>>>      at
>>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 
>>>> 2207)
>>>>
>>>>
>>>> This is while running the synthetic_control.data example, but I  
>>>> have the
>>>> same problems with any other input data.
>>>>
>>>> I am able to do other map-reduce job without problems.
>>>>
>>>> Here is the output of the jar task:
>>>>
>>>> hadoop@philippe-vaio:/usr/local/hadoop$ bin/hadoop jar
>>>>
>>>> /home/philippe/workspace/MahoutJava/examples/dist/apache-mahout- 
>>>> examples-0.1-dev.jar
>>>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>>>> 08/10/25 19:09:27 WARN mapred.JobClient: Use GenericOptionsParser  
>>>> for
>>>> parsing the arguments. Applications should implement Tool for the  
>>>> same.
>>>> 08/10/25 19:09:28 INFO mapred.FileInputFormat: Total input paths to
>>>> process
>>>> : 1
>>>> 08/10/25 19:09:28 INFO mapred.FileInputFormat: Total input paths to
>>>> process
>>>> : 1
>>>> 08/10/25 19:09:28 INFO mapred.JobClient: Running job:
>>>> job_200810251826_0010
>>>> 08/10/25 19:09:29 INFO mapred.JobClient:  map 0% reduce 0%
>>>> 08/10/25 19:09:31 INFO mapred.JobClient:  map 50% reduce 0%
>>>> 08/10/25 19:09:32 INFO mapred.JobClient: Job complete:
>>>> job_200810251826_0010
>>>> 08/10/25 19:09:32 INFO mapred.JobClient: Counters: 7
>>>> 08/10/25 19:09:32 INFO mapred.JobClient:   File Systems
>>>> 08/10/25 19:09:32 INFO mapred.JobClient:     HDFS bytes read=291644
>>>> 08/10/25 19:09:32 INFO mapred.JobClient:     HDFS bytes  
>>>> written=323660
>>>> 08/10/25 19:09:32 INFO mapred.JobClient:   Job Counters
>>>> 08/10/25 19:09:32 INFO mapred.JobClient:     Launched map tasks=2
>>>> 08/10/25 19:09:32 INFO mapred.JobClient:     Data-local map tasks=2
>>>> 08/10/25 19:09:32 INFO mapred.JobClient:   Map-Reduce Framework
>>>> 08/10/25 19:09:32 INFO mapred.JobClient:     Map input records=600
>>>> 08/10/25 19:09:32 INFO mapred.JobClient:     Map input bytes=288374
>>>> 08/10/25 19:09:32 INFO mapred.JobClient:     Map output records=600
>>>> 08/10/25 19:09:32 WARN mapred.JobClient: Use GenericOptionsParser  
>>>> for
>>>> parsing the arguments. Applications should implement Tool for the  
>>>> same.
>>>> 08/10/25 19:09:32 INFO mapred.FileInputFormat: Total input paths to
>>>> process
>>>> : 2
>>>> 08/10/25 19:09:32 INFO mapred.FileInputFormat: Total input paths to
>>>> process
>>>> : 2
>>>> 08/10/25 19:09:32 INFO mapred.JobClient: Running job:
>>>> job_200810251826_0011
>>>> 08/10/25 19:09:33 INFO mapred.JobClient:  map 0% reduce 0%
>>>> 08/10/25 19:09:37 INFO mapred.JobClient:  map 50% reduce 0%
>>>> 08/10/25 19:09:39 INFO mapred.JobClient:  map 100% reduce 0%
>>>> 08/10/25 19:09:44 INFO mapred.JobClient:  map 100% reduce 16%
>>>> 08/10/25 19:09:52 INFO mapred.JobClient: Job complete:
>>>> job_200810251826_0011
>>>> 08/10/25 19:09:52 INFO mapred.JobClient: Counters: 16
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:   File Systems
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     HDFS bytes read=323660
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     HDFS bytes  
>>>> written=1447
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Local bytes read=1389
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Local bytes  
>>>> written=37878
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:   Job Counters
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Launched reduce  
>>>> tasks=1
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Launched map tasks=2
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Data-local map tasks=2
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:   Map-Reduce Framework
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Reduce input groups=1
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Combine output  
>>>> records=29
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Map input records=600
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Reduce output  
>>>> records=1
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Map output  
>>>> bytes=943020
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Map input bytes=323660
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Combine input  
>>>> records=1760
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Map output  
>>>> records=1732
>>>> 08/10/25 19:09:52 INFO mapred.JobClient:     Reduce input records=1
>>>> 08/10/25 19:09:53 WARN mapred.JobClient: Use GenericOptionsParser  
>>>> for
>>>> parsing the arguments. Applications should implement Tool for the  
>>>> same.
>>>> 08/10/25 19:09:53 INFO mapred.FileInputFormat: Total input paths to
>>>> process
>>>> : 2
>>>> 08/10/25 19:09:53 INFO mapred.FileInputFormat: Total input paths to
>>>> process
>>>> : 2
>>>> 08/10/25 19:09:53 INFO mapred.JobClient: Running job:
>>>> job_200810251826_0012
>>>> 08/10/25 19:09:54 INFO mapred.JobClient:  map 0% reduce 0%
>>>> 08/10/25 19:09:56 INFO mapred.JobClient:  map 50% reduce 0%
>>>> 08/10/25 19:09:58 INFO mapred.JobClient:  map 100% reduce 0%
>>>> 08/10/25 19:10:02 INFO mapred.JobClient: Job complete:
>>>> job_200810251826_0012
>>>> 08/10/25 19:10:02 INFO mapred.JobClient: Counters: 16
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:   File Systems
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     HDFS bytes read=326554
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     HDFS bytes  
>>>> written=1137260
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Local bytes  
>>>> read=1147358
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Local bytes  
>>>> written=2304490
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:   Job Counters
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Launched reduce  
>>>> tasks=1
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Launched map tasks=2
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Data-local map tasks=2
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:   Map-Reduce Framework
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Reduce input groups=1
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Combine output  
>>>> records=0
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Map input records=600
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Reduce output  
>>>> records=600
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Map output  
>>>> bytes=1139660
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Map input bytes=323660
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Combine input  
>>>> records=0
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Map output records=600
>>>> 08/10/25 19:10:02 INFO mapred.JobClient:     Reduce input  
>>>> records=600
>>>> 08/10/25 19:10:02 INFO kmeans.KMeansDriver: Iteration 0
>>>> 08/10/25 19:10:02 WARN mapred.JobClient: Use GenericOptionsParser  
>>>> for
>>>> parsing the arguments. Applications should implement Tool for the  
>>>> same.
>>>> 08/10/25 19:10:02 INFO mapred.FileInputFormat: Total input paths to
>>>> process
>>>> : 2
>>>> 08/10/25 19:10:02 INFO mapred.FileInputFormat: Total input paths to
>>>> process
>>>> : 2
>>>> 08/10/25 19:10:03 INFO mapred.JobClient: Running job:
>>>> job_200810251826_0013
>>>> 08/10/25 19:10:04 INFO mapred.JobClient:  map 0% reduce 0%
>>>> 08/10/25 19:10:08 INFO mapred.JobClient:  map 50% reduce 0%
>>>> 08/10/25 19:10:09 INFO mapred.JobClient:  map 100% reduce 0%
>>>> 08/10/25 19:10:21 INFO mapred.JobClient: Task Id :
>>>> attempt_200810251826_0013_r_000000_0, Status : FAILED
>>>> java.io.IOException: attempt_200810251826_0013_r_000000_0The  
>>>> reduce copier
>>>> failed
>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
>>>> at
>>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 
>>>> 2207)
>>>>
>>>>
>>>> I am not sure if I am doing something wrong here.
>>>>
>>>> Thanks for the help,
>>>>
>>>> Philippe.
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
>>> http://www.lucenebootcamp.com
>>>
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
> --------------------------
> Grant Ingersoll
> Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
> http://www.lucenebootcamp.com
>
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>

--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










Mime
View raw message