mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <>
Subject Re: Streaming KMeans clustering
Date Fri, 03 Jan 2014 21:26:33 GMT
There is no combiner in the present implementation.  Moreover the codepath that's executed
when 'reduceStreamingKMeans' -rskm flag is set does not have adequate test coverage and needs
to be tested more extensively. Most of the issues I had been seeing were due to specifying
-rskm flag.  

Amir had provided a dataset with about 300K points, could someone try running Streaming KMeans
on this - both mapreduce and sequential versions? I have had no luck with either version.
Here is the link to the dataset -

On Thursday, December 26, 2013 3:02 PM, Ted Dunning <> wrote:

On Thu, Dec 26, 2013 at 10:19 AM, Suneel Marthi <> wrote:

I heard people outside of dev@ and user@ who have tried running Streaming KMeans (from 0.8)
on their Production clusters on large datasets and had seen the job crash in the Reduce phase
due to OOM errors (this is with -Xmx2GB).
Excessive memory usage in reduce was a known bug that was addressed (supposedly) by using
a combiner.

This really smells like bug resurrection happened somehow.  Clearly that also means that
our unit tests are insufficient.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message