mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suneel Marthi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1469) Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true
Date Sun, 27 Apr 2014 08:32:17 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982251#comment-13982251
] 

Suneel Marthi commented on MAHOUT-1469:
---------------------------------------

This definitely needs fixing.  Present Streaming KMeans impl is just not functional otherwise
as has been reported by few users over the last several months.

 Issue (1) is not an valid as per the discussion in this thread.
 Issue (3) is not valid as it doesn't make sense having -rskm flag when executed in sequential
mode, but need more adequate test coverage for the sequential execution.

 Issue 4 is a corner case that was never accounted for in the impl and needs fixing. Issue
2, I am not sure. 

There's one other issue about not updating estimatedDistanceCutoff in clusterInternal() that
Maxim had observed and I had long noticed (as did others on user@ before) is a choking point
during execution.


> Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is
set to true
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1469
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1469
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.9
>            Reporter: Suneel Marthi
>            Assignee: Suneel Marthi
>             Fix For: 1.0
>
>
> Centroids are not being generated when executed in MR mode with -rskm flag set. 
> {Code}
> 14/03/20 02:42:12 INFO mapreduce.StreamingKMeansThread: Estimated Points: 282
> 14/03/20 02:42:12 INFO mapred.JobClient:  map 100% reduce 0%
> 14/03/20 02:42:14 INFO mapreduce.StreamingKMeansReducer: Number of Centroids: 0
> 14/03/20 02:42:14 WARN mapred.LocalJobRunner: job_local1374896815_0001
> java.lang.IllegalArgumentException: Must have nonzero number of training and test vectors.
Asked for %.1f %% of %d vectors for test [10.000000149011612, 0]
> 	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
> 	at org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
> 	at org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
> 	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
> 	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
> 	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> 14/03/20 02:42:14 INFO mapred.JobClient: Job complete: job_local1374896815_0001
> 14/03/20 02:42:14 INFO mapred.JobClient: Counters: 16
> 14/03/20 02:42:14 INFO mapred.JobClient:   File Input Format Counters 
> 14/03/20 02:42:14 INFO mapred.JobClient:     Bytes Read=17156391
> 14/03/20 02:42:14 INFO mapred.JobClient:   FileSystemCounters
> 14/03/20 02:42:14 INFO mapred.JobClient:     FILE_BYTES_READ=41925624
> 14/03/20 02:42:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=25974741
> 14/03/20 02:42:14 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output materialized bytes=956293
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map input records=21578
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce shuffle bytes=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Spilled Records=282
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output bytes=1788012
> 14/03/20 02:42:14 INFO mapred.JobClient:     Total committed heap usage (bytes)=217214976
> 14/03/20 02:42:14 INFO mapred.JobClient:     Combine input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=163
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce input groups=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Combine output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output records=282
> 14/03/20 02:42:14 INFO driver.MahoutDriver: Program took 506269 ms (Minutes: 8.437816666666667)
> {Code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message