mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Memory Issue with KMeans clustering
Date Mon, 07 Feb 2011 19:17:41 GMT
The problem is that the centroids are the average of many documents.  This
means that the number o non-zero elements in each centroid vector increases
as the number of documents increases.

This can be handled in a few ways:

- do the averaging in a sparsity preserving way.  LLR is one such animal.
 It is probably possible to do an L_1 regularized centroid as well (but I
would have to think that through a while).

- use fixed size vectors as with hashed encodings.  Then we don't care (as
much) that the centroids are dense.

On Mon, Feb 7, 2011 at 11:05 AM, Robin Anil <robin.anil@gmail.com> wrote:

> We can prolly find the nearest centroid, instead of averaging it out. This
> way centroid vector wont grow big? What do you think about that Ted, Jeff?
>
> On Fri, Feb 4, 2011 at 9:23 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> > 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will
> > tend
> > to become dense)
> >
> > I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
> > could decrease your memory needs to 4GB at the low end.
> >
> > What kind of input do you have?
> >
> > On Fri, Feb 4, 2011 at 7:50 AM, james q <james.quacinella@gmail.com>
> > wrote:
> >
> > > I think the job had 5000 - 6000 clusters. The input (sparse) vectors
> had
> > a
> > > dimension of 6838856.
> > >
> > > -- james
> > >
> > > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> > >
> > > > How many clusters?
> > > >
> > > > How large is the dimension of your input data?
> > > >
> > > > On Thu, Feb 3, 2011 at 9:05 PM, james q <james.quacinella@gmail.com>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > New user to mahout and hadoop here. Isabel Drost suggested to a
> > > colleague
> > > > I
> > > > > should post to the mahout user list, as I am having some general
> > > > > difficulties with memory consumption and KMeans clustering.
> > > > >
> > > > > So a general question first and foremost: what determines how much
> > > memory
> > > > > does a map task consume during a KMeans clustering job? Increasing
> > the
> > > > > number of map tasks by adjusting dfs.block.size and
> > > mapred.max.split.size
> > > > > doesn't seem to make the map task consume less memory. Or at least
> > not
> > > a
> > > > > very noticeable amount. I figured if there are more map tasks, each
> > > > > individual map task evaluates less input keys and hence there would
> > be
> > > > less
> > > > > memory consumption. Is there any way to predict memory usage of map
> > > tasks
> > > > > in
> > > > > KMeans?
> > > > >
> > > > > The cluster I am running consists of 10 machines, each with 8 cores
> > and
> > > > 68G
> > > > > of ram. I've configured the cluster to have each machine, at
> maximum,
> > > run
> > > > 7
> > > > > map or reduce tasks. I set the map and reduce tasks to have
> virtually
> > > no
> > > > > limit on memory consumption ... so with 7 processes each, at around
> 9
> > -
> > > > 10G
> > > > > per process, the machines will crap out. I can reduce the number
of
> > map
> > > > > tasks per machine, but something tells me that that level of memory
> > > > > consumption is wrong.
> > > > >
> > > > > If any more information is needed to help debug this, please let
me
> > > know!
> > > > > Thanks!
> > > > >
> > > > > -- james
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message