mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <>
Subject Re: Streaming KMeans clustering
Date Thu, 26 Dec 2013 18:19:30 GMT
I would push the code freeze until this is resolved (and the reason I had been holding off).
This is something that should have been raised for 0.8 release and I dob;t think we should
defer this to the next one.

I heard people outside of dev@ and user@ who have tried running Streaming KMeans (from 0.8)
on their Production clusters on large datasets and had seen the job crash in the Reduce phase
due to OOM errors (this is with -Xmx2GB). 

On Thursday, December 26, 2013 12:53 PM, Isabel Drost-Fromm <> wrote:
On Thu, Dec 26, 2013 at 12:28:18AM -0800, Suneel Marthi wrote:

> Its when you increase the no. of documents and the size of each
>  document (add more dimensions) that you start seeing performance issues which are:
> a)The Mappers take long to complete and its either the searcher.remove() or searcher.searchFirst()
calls (will check again in my next attempt) that seems to be the bottleneck.
> b) Once the Mappers complete (after several hours) the Reducer dies with an OOM exception
(despite having set -Xmx2G).

Given that there seem to be a couple of people experiencing issues I think it makes sense
to create a JIRA issue here to track progress - either code improvements or better documentation
on how to run this implementation.

@Suneel: Does it make sense to push code freeze to after fixing this or should this be communicated
as a known defect in the release notes?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message