mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: Memory Issue with KMeans clustering
Date Mon, 07 Feb 2011 19:05:23 GMT
We can prolly find the nearest centroid, instead of averaging it out. This
way centroid vector wont grow big? What do you think about that Ted, Jeff?

On Fri, Feb 4, 2011 at 9:23 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will
> tend
> to become dense)
>
> I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
> could decrease your memory needs to 4GB at the low end.
>
> What kind of input do you have?
>
> On Fri, Feb 4, 2011 at 7:50 AM, james q <james.quacinella@gmail.com>
> wrote:
>
> > I think the job had 5000 - 6000 clusters. The input (sparse) vectors had
> a
> > dimension of 6838856.
> >
> > -- james
> >
> > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >
> > > How many clusters?
> > >
> > > How large is the dimension of your input data?
> > >
> > > On Thu, Feb 3, 2011 at 9:05 PM, james q <james.quacinella@gmail.com>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > New user to mahout and hadoop here. Isabel Drost suggested to a
> > colleague
> > > I
> > > > should post to the mahout user list, as I am having some general
> > > > difficulties with memory consumption and KMeans clustering.
> > > >
> > > > So a general question first and foremost: what determines how much
> > memory
> > > > does a map task consume during a KMeans clustering job? Increasing
> the
> > > > number of map tasks by adjusting dfs.block.size and
> > mapred.max.split.size
> > > > doesn't seem to make the map task consume less memory. Or at least
> not
> > a
> > > > very noticeable amount. I figured if there are more map tasks, each
> > > > individual map task evaluates less input keys and hence there would
> be
> > > less
> > > > memory consumption. Is there any way to predict memory usage of map
> > tasks
> > > > in
> > > > KMeans?
> > > >
> > > > The cluster I am running consists of 10 machines, each with 8 cores
> and
> > > 68G
> > > > of ram. I've configured the cluster to have each machine, at maximum,
> > run
> > > 7
> > > > map or reduce tasks. I set the map and reduce tasks to have virtually
> > no
> > > > limit on memory consumption ... so with 7 processes each, at around 9
> -
> > > 10G
> > > > per process, the machines will crap out. I can reduce the number of
> map
> > > > tasks per machine, but something tells me that that level of memory
> > > > consumption is wrong.
> > > >
> > > > If any more information is needed to help debug this, please let me
> > know!
> > > > Thanks!
> > > >
> > > > -- james
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message