mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxim Arap (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1469) Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true
Date Sun, 27 Apr 2014 17:33:16 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982401#comment-13982401
] 

Maxim Arap commented on MAHOUT-1469:
------------------------------------

Suneel: I modified clusterInternal() on my local copy of the trunk so that distanceCutoff
is updated every time numClusters is updated (namely,  distanceCutoff = distanceCutoff  *
numClustersOld / numClusters). I compiled and ran the new version in pseudo-distributed mapreduce
mode on my laptop on the Reuters dataset. The purpose was to compare the run times with and
without the tweak of distanceCutoff updating. The results are as follows. 

For the Reuters dataset, if the final number of clusters is 10, the estimated number of clusters
for the streaming step is about 100 (the number of documents in Reuters data set is about
21,600; hence if k=10 and n=21,600 then k log n is about 100). Here's outcome of my experiment:

Original distanceCutoff updating: 

Initial numClusters: 100 
Final numClusters: 215
Run time in minutes: 8.00858

Tweaked distanceCutoff updating:

Initial numClusters: 100 
Final numClusters: 269
Run time in minutes: 8.12646

What if we give the programs an underestimated initial numClusters? Not much changes: 

Original distanceCutoff updating: 
Initial numClusters: 25
Final numClusters: 271
Run time in minutes: 7.5418

Tweaked distanceCutoff updating:
Initial numClusters: 25
Final numClusters: 307
Run time in minutes: 7.77573

I did the analogous comparison for sequential mode and the situation is similar. 

My experiment suggests that the bottleneck is not in the distanceCutoff update rule. Harry
Lang and I suspect that the bottleneck lies in the parallelization scheme (i.e., StreamingKMeansMapper.java
and StreamingKMeansReducer.java). 



> Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is
set to true
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1469
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1469
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.9
>            Reporter: Suneel Marthi
>            Assignee: Suneel Marthi
>             Fix For: 1.0
>
>
> Centroids are not being generated when executed in MR mode with -rskm flag set. 
> {Code}
> 14/03/20 02:42:12 INFO mapreduce.StreamingKMeansThread: Estimated Points: 282
> 14/03/20 02:42:12 INFO mapred.JobClient:  map 100% reduce 0%
> 14/03/20 02:42:14 INFO mapreduce.StreamingKMeansReducer: Number of Centroids: 0
> 14/03/20 02:42:14 WARN mapred.LocalJobRunner: job_local1374896815_0001
> java.lang.IllegalArgumentException: Must have nonzero number of training and test vectors.
Asked for %.1f %% of %d vectors for test [10.000000149011612, 0]
> 	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
> 	at org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
> 	at org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
> 	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
> 	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
> 	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> 14/03/20 02:42:14 INFO mapred.JobClient: Job complete: job_local1374896815_0001
> 14/03/20 02:42:14 INFO mapred.JobClient: Counters: 16
> 14/03/20 02:42:14 INFO mapred.JobClient:   File Input Format Counters 
> 14/03/20 02:42:14 INFO mapred.JobClient:     Bytes Read=17156391
> 14/03/20 02:42:14 INFO mapred.JobClient:   FileSystemCounters
> 14/03/20 02:42:14 INFO mapred.JobClient:     FILE_BYTES_READ=41925624
> 14/03/20 02:42:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=25974741
> 14/03/20 02:42:14 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output materialized bytes=956293
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map input records=21578
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce shuffle bytes=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Spilled Records=282
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output bytes=1788012
> 14/03/20 02:42:14 INFO mapred.JobClient:     Total committed heap usage (bytes)=217214976
> 14/03/20 02:42:14 INFO mapred.JobClient:     Combine input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=163
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce input groups=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Combine output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output records=282
> 14/03/20 02:42:14 INFO driver.MahoutDriver: Program took 506269 ms (Minutes: 8.437816666666667)
> {Code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message