mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Memory Issue with KMeans clustering
Date Fri, 04 Feb 2011 19:56:06 GMT
The problem is that any average of multiple vectors is going to have lots of
non-zero values.

A model based approach could use a gradient descent approach with
regularization to build classifiers that then define the training data for
the next round of classifier building.  I have seen lots of over-fitting
with that kind of approach, however.  Strong regularization might help.

On Fri, Feb 4, 2011 at 8:42 AM, Jeff Eastman <jeastman@narus.com> wrote:

> That's really the big challenge using kmeans (and probably any of the other
> clustering algorithms too) for text clustering: the centroids tend to become
> dense and the memory consumption skyrockets. I wonder if the centroid
> calculation could be made smarter by setting an underflow limit and forcing
> close-to-zero terms to be exactly zero? I guess the challenge would be to
> dynamically select this limit. Or, perhaps implementing an approximating
> vector which only retains its n most significant terms? Thin ice here...
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Friday, February 04, 2011 7:54 AM
> To: user@mahout.apache.org
> Subject: Re: Memory Issue with KMeans clustering
>
> 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will
> tend
> to become dense)
>
> I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
> could decrease your memory needs to 4GB at the low end.
>
> What kind of input do you have?
>
> On Fri, Feb 4, 2011 at 7:50 AM, james q <james.quacinella@gmail.com>
> wrote:
>
> > I think the job had 5000 - 6000 clusters. The input (sparse) vectors had
> a
> > dimension of 6838856.
> >
> > -- james
> >
> > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >
> > > How many clusters?
> > >
> > > How large is the dimension of your input data?
> > >
> > > On Thu, Feb 3, 2011 at 9:05 PM, james q <james.quacinella@gmail.com>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > New user to mahout and hadoop here. Isabel Drost suggested to a
> > colleague
> > > I
> > > > should post to the mahout user list, as I am having some general
> > > > difficulties with memory consumption and KMeans clustering.
> > > >
> > > > So a general question first and foremost: what determines how much
> > memory
> > > > does a map task consume during a KMeans clustering job? Increasing
> the
> > > > number of map tasks by adjusting dfs.block.size and
> > mapred.max.split.size
> > > > doesn't seem to make the map task consume less memory. Or at least
> not
> > a
> > > > very noticeable amount. I figured if there are more map tasks, each
> > > > individual map task evaluates less input keys and hence there would
> be
> > > less
> > > > memory consumption. Is there any way to predict memory usage of map
> > tasks
> > > > in
> > > > KMeans?
> > > >
> > > > The cluster I am running consists of 10 machines, each with 8 cores
> and
> > > 68G
> > > > of ram. I've configured the cluster to have each machine, at maximum,
> > run
> > > 7
> > > > map or reduce tasks. I set the map and reduce tasks to have virtually
> > no
> > > > limit on memory consumption ... so with 7 processes each, at around 9
> -
> > > 10G
> > > > per process, the machines will crap out. I can reduce the number of
> map
> > > > tasks per machine, but something tells me that that level of memory
> > > > consumption is wrong.
> > > >
> > > > If any more information is needed to help debug this, please let me
> > know!
> > > > Thanks!
> > > >
> > > > -- james
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message