Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D2AD211E3C for ; Sun, 27 Apr 2014 08:32:23 +0000 (UTC) Received: (qmail 53826 invoked by uid 500); 27 Apr 2014 08:32:20 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 53654 invoked by uid 500); 27 Apr 2014 08:32:17 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 53523 invoked by uid 99); 27 Apr 2014 08:32:17 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 27 Apr 2014 08:32:17 +0000 Date: Sun, 27 Apr 2014 08:32:17 +0000 (UTC) From: "Suneel Marthi (JIRA)" To: dev@mahout.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAHOUT-1469) Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAHOUT-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982251#comment-13982251 ] Suneel Marthi commented on MAHOUT-1469: --------------------------------------- This definitely needs fixing. Present Streaming KMeans impl is just not functional otherwise as has been reported by few users over the last several months. Issue (1) is not an valid as per the discussion in this thread. Issue (3) is not valid as it doesn't make sense having -rskm flag when executed in sequential mode, but need more adequate test coverage for the sequential execution. Issue 4 is a corner case that was never accounted for in the impl and needs fixing. Issue 2, I am not sure. There's one other issue about not updating estimatedDistanceCutoff in clusterInternal() that Maxim had observed and I had long noticed (as did others on user@ before) is a choking point during execution. > Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true > ------------------------------------------------------------------------------------------------- > > Key: MAHOUT-1469 > URL: https://issues.apache.org/jira/browse/MAHOUT-1469 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.9 > Reporter: Suneel Marthi > Assignee: Suneel Marthi > Fix For: 1.0 > > > Centroids are not being generated when executed in MR mode with -rskm flag set. > {Code} > 14/03/20 02:42:12 INFO mapreduce.StreamingKMeansThread: Estimated Points: 282 > 14/03/20 02:42:12 INFO mapred.JobClient: map 100% reduce 0% > 14/03/20 02:42:14 INFO mapreduce.StreamingKMeansReducer: Number of Centroids: 0 > 14/03/20 02:42:14 WARN mapred.LocalJobRunner: job_local1374896815_0001 > java.lang.IllegalArgumentException: Must have nonzero number of training and test vectors. Asked for %.1f %% of %d vectors for test [10.000000149011612, 0] > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:148) > at org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176) > at org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192) > at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107) > at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73) > at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) > at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) > 14/03/20 02:42:14 INFO mapred.JobClient: Job complete: job_local1374896815_0001 > 14/03/20 02:42:14 INFO mapred.JobClient: Counters: 16 > 14/03/20 02:42:14 INFO mapred.JobClient: File Input Format Counters > 14/03/20 02:42:14 INFO mapred.JobClient: Bytes Read=17156391 > 14/03/20 02:42:14 INFO mapred.JobClient: FileSystemCounters > 14/03/20 02:42:14 INFO mapred.JobClient: FILE_BYTES_READ=41925624 > 14/03/20 02:42:14 INFO mapred.JobClient: FILE_BYTES_WRITTEN=25974741 > 14/03/20 02:42:14 INFO mapred.JobClient: Map-Reduce Framework > 14/03/20 02:42:14 INFO mapred.JobClient: Map output materialized bytes=956293 > 14/03/20 02:42:14 INFO mapred.JobClient: Map input records=21578 > 14/03/20 02:42:14 INFO mapred.JobClient: Reduce shuffle bytes=0 > 14/03/20 02:42:14 INFO mapred.JobClient: Spilled Records=282 > 14/03/20 02:42:14 INFO mapred.JobClient: Map output bytes=1788012 > 14/03/20 02:42:14 INFO mapred.JobClient: Total committed heap usage (bytes)=217214976 > 14/03/20 02:42:14 INFO mapred.JobClient: Combine input records=0 > 14/03/20 02:42:14 INFO mapred.JobClient: SPLIT_RAW_BYTES=163 > 14/03/20 02:42:14 INFO mapred.JobClient: Reduce input records=0 > 14/03/20 02:42:14 INFO mapred.JobClient: Reduce input groups=0 > 14/03/20 02:42:14 INFO mapred.JobClient: Combine output records=0 > 14/03/20 02:42:14 INFO mapred.JobClient: Reduce output records=0 > 14/03/20 02:42:14 INFO mapred.JobClient: Map output records=282 > 14/03/20 02:42:14 INFO driver.MahoutDriver: Program took 506269 ms (Minutes: 8.437816666666667) > {Code} -- This message was sent by Atlassian JIRA (v6.2#6252)