mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <jeast...@Narus.com>
Subject RE: Memory Issue with KMeans clustering
Date Fri, 04 Feb 2011 16:42:40 GMT
That's really the big challenge using kmeans (and probably any of the other clustering algorithms
too) for text clustering: the centroids tend to become dense and the memory consumption skyrockets.
I wonder if the centroid calculation could be made smarter by setting an underflow limit and
forcing close-to-zero terms to be exactly zero? I guess the challenge would be to dynamically
select this limit. Or, perhaps implementing an approximating vector which only retains its
n most significant terms? Thin ice here...

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Friday, February 04, 2011 7:54 AM
To: user@mahout.apache.org
Subject: Re: Memory Issue with KMeans clustering

5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will tend
to become dense)

I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
could decrease your memory needs to 4GB at the low end.

What kind of input do you have?

On Fri, Feb 4, 2011 at 7:50 AM, james q <james.quacinella@gmail.com> wrote:

> I think the job had 5000 - 6000 clusters. The input (sparse) vectors had a
> dimension of 6838856.
>
> -- james
>
> On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> > How many clusters?
> >
> > How large is the dimension of your input data?
> >
> > On Thu, Feb 3, 2011 at 9:05 PM, james q <james.quacinella@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > New user to mahout and hadoop here. Isabel Drost suggested to a
> colleague
> > I
> > > should post to the mahout user list, as I am having some general
> > > difficulties with memory consumption and KMeans clustering.
> > >
> > > So a general question first and foremost: what determines how much
> memory
> > > does a map task consume during a KMeans clustering job? Increasing the
> > > number of map tasks by adjusting dfs.block.size and
> mapred.max.split.size
> > > doesn't seem to make the map task consume less memory. Or at least not
> a
> > > very noticeable amount. I figured if there are more map tasks, each
> > > individual map task evaluates less input keys and hence there would be
> > less
> > > memory consumption. Is there any way to predict memory usage of map
> tasks
> > > in
> > > KMeans?
> > >
> > > The cluster I am running consists of 10 machines, each with 8 cores and
> > 68G
> > > of ram. I've configured the cluster to have each machine, at maximum,
> run
> > 7
> > > map or reduce tasks. I set the map and reduce tasks to have virtually
> no
> > > limit on memory consumption ... so with 7 processes each, at around 9 -
> > 10G
> > > per process, the machines will crap out. I can reduce the number of map
> > > tasks per machine, but something tells me that that level of memory
> > > consumption is wrong.
> > >
> > > If any more information is needed to help debug this, please let me
> know!
> > > Thanks!
> > >
> > > -- james
> > >
> >
>
Mime
View raw message