Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mahout.apache.org
Date: Sun, 27 Apr 2014 08:32:17 +0000 (UTC)
From: "Suneel Marthi (JIRA)" <jira@apache.org>
To: dev@mahout.apache.org
Message-ID: <JIRA.12702567.1395298041614.188087.1398587537326@arcas>
In-Reply-To: <JIRA.12702567.1395298041614@arcas>
References: <JIRA.12702567.1395298041614@arcas>
Subject: [jira] [Commented] (MAHOUT-1469) Streaming KMeans fails when
 executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAHOUT-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982251#comment-13982251 ] 

Suneel Marthi commented on MAHOUT-1469:
---------------------------------------

This definitely needs fixing.  Present Streaming KMeans impl is just not functional otherwise as has been reported by few users over the last several months.

 Issue (1) is not an valid as per the discussion in this thread.
 Issue (3) is not valid as it doesn't make sense having -rskm flag when executed in sequential mode, but need more adequate test coverage for the sequential execution.

 Issue 4 is a corner case that was never accounted for in the impl and needs fixing. Issue 2, I am not sure. 

There's one other issue about not updating estimatedDistanceCutoff in clusterInternal() that Maxim had observed and I had long noticed (as did others on user@ before) is a choking point during execution.


> Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1469
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1469
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.9
>            Reporter: Suneel Marthi
>            Assignee: Suneel Marthi
>             Fix For: 1.0
>
>
> Centroids are not being generated when executed in MR mode with -rskm flag set. 
> {Code}
> 14/03/20 02:42:12 INFO mapreduce.StreamingKMeansThread: Estimated Points: 282
> 14/03/20 02:42:12 INFO mapred.JobClient:  map 100% reduce 0%
> 14/03/20 02:42:14 INFO mapreduce.StreamingKMeansReducer: Number of Centroids: 0
> 14/03/20 02:42:14 WARN mapred.LocalJobRunner: job_local1374896815_0001
> java.lang.IllegalArgumentException: Must have nonzero number of training and test vectors. Asked for %.1f %% of %d vectors for test [10.000000149011612, 0]
> 	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
> 	at org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
> 	at org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
> 	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
> 	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
> 	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> 14/03/20 02:42:14 INFO mapred.JobClient: Job complete: job_local1374896815_0001
> 14/03/20 02:42:14 INFO mapred.JobClient: Counters: 16
> 14/03/20 02:42:14 INFO mapred.JobClient:   File Input Format Counters 
> 14/03/20 02:42:14 INFO mapred.JobClient:     Bytes Read=17156391
> 14/03/20 02:42:14 INFO mapred.JobClient:   FileSystemCounters
> 14/03/20 02:42:14 INFO mapred.JobClient:     FILE_BYTES_READ=41925624
> 14/03/20 02:42:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=25974741
> 14/03/20 02:42:14 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output materialized bytes=956293
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map input records=21578
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce shuffle bytes=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Spilled Records=282
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output bytes=1788012
> 14/03/20 02:42:14 INFO mapred.JobClient:     Total committed heap usage (bytes)=217214976
> 14/03/20 02:42:14 INFO mapred.JobClient:     Combine input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=163
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce input groups=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Combine output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output records=282
> 14/03/20 02:42:14 INFO driver.MahoutDriver: Program took 506269 ms (Minutes: 8.437816666666667)
> {Code}


--
This message was sent by Atlassian JIRA
(v6.2#6252)