mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <jeast...@Narus.com>
Subject RE: Memory Issue with KMeans clustering
Date Fri, 04 Feb 2011 23:31:41 GMT
This is intriguing. Can you say a bit more about "more stages per iteration"?

-----Original Message-----
From: Severance, Steve [mailto:sseverance@ebay.com] 
Sent: Friday, February 04, 2011 2:45 PM
To: user@mahout.apache.org
Subject: RE: Memory Issue with KMeans clustering

At eBay we moved all clustering off mahout to our own implementation. It was more stages per
iteration but we could use our high dimensional feature spaces with our chosen number of targets.
We also used sparse vectors as opposed to dense vectors.

Steve

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Friday, February 04, 2011 7:54 AM
To: user@mahout.apache.org
Subject: Re: Memory Issue with KMeans clustering

5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will tend
to become dense)

I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
could decrease your memory needs to 4GB at the low end.

What kind of input do you have?

On Fri, Feb 4, 2011 at 7:50 AM, james q <james.quacinella@gmail.com> wrote:

> I think the job had 5000 - 6000 clusters. The input (sparse) vectors had a
> dimension of 6838856.
>
> -- james
>
> On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> > How many clusters?
> >
> > How large is the dimension of your input data?
> >
> > On Thu, Feb 3, 2011 at 9:05 PM, james q <james.quacinella@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > New user to mahout and hadoop here. Isabel Drost suggested to a
> colleague
> > I
> > > should post to the mahout user list, as I am having some general
> > > difficulties with memory consumption and KMeans clustering.
> > >
> > > So a general question first and foremost: what determines how much
> memory
> > > does a map task consume during a KMeans clustering job? Increasing the
> > > number of map tasks by adjusting dfs.block.size and
> mapred.max.split.size
> > > doesn't seem to make the map task consume less memory. Or at least not
> a
> > > very noticeable amount. I figured if there are more map tasks, each
> > > individual map task evaluates less input keys and hence there would be
> > less
> > > memory consumption. Is there any way to predict memory usage of map
> tasks
> > > in
> > > KMeans?
> > >
> > > The cluster I am running consists of 10 machines, each with 8 cores and
> > 68G
> > > of ram. I've configured the cluster to have each machine, at maximum,
> run
> > 7
> > > map or reduce tasks. I set the map and reduce tasks to have virtually
> no
> > > limit on memory consumption ... so with 7 processes each, at around 9 -
> > 10G
> > > per process, the machines will crap out. I can reduce the number of map
> > > tasks per machine, but something tells me that that level of memory
> > > consumption is wrong.
> > >
> > > If any more information is needed to help debug this, please let me
> know!
> > > Thanks!
> > >
> > > -- james
> > >
> >
>
Mime
View raw message