mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: Memory Issue with KMeans clustering
Date Mon, 07 Feb 2011 19:19:29 GMT
Nearest point to the centroid instead of average of points*

On Tue, Feb 8, 2011 at 12:35 AM, Robin Anil <robin.anil@gmail.com> wrote:

> We can prolly find the nearest centroid, instead of averaging it out. This
> way centroid vector wont grow big? What do you think about that Ted, Jeff?
>
>
> On Fri, Feb 4, 2011 at 9:23 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
>> 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will
>> tend
>> to become dense)
>>
>> I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
>> could decrease your memory needs to 4GB at the low end.
>>
>> What kind of input do you have?
>>
>> On Fri, Feb 4, 2011 at 7:50 AM, james q <james.quacinella@gmail.com>
>> wrote:
>>
>> > I think the job had 5000 - 6000 clusters. The input (sparse) vectors had
>> a
>> > dimension of 6838856.
>> >
>> > -- james
>> >
>> > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>> >
>> > > How many clusters?
>> > >
>> > > How large is the dimension of your input data?
>> > >
>> > > On Thu, Feb 3, 2011 at 9:05 PM, james q <james.quacinella@gmail.com>
>> > > wrote:
>> > >
>> > > > Hello,
>> > > >
>> > > > New user to mahout and hadoop here. Isabel Drost suggested to a
>> > colleague
>> > > I
>> > > > should post to the mahout user list, as I am having some general
>> > > > difficulties with memory consumption and KMeans clustering.
>> > > >
>> > > > So a general question first and foremost: what determines how much
>> > memory
>> > > > does a map task consume during a KMeans clustering job? Increasing
>> the
>> > > > number of map tasks by adjusting dfs.block.size and
>> > mapred.max.split.size
>> > > > doesn't seem to make the map task consume less memory. Or at least
>> not
>> > a
>> > > > very noticeable amount. I figured if there are more map tasks, each
>> > > > individual map task evaluates less input keys and hence there would
>> be
>> > > less
>> > > > memory consumption. Is there any way to predict memory usage of map
>> > tasks
>> > > > in
>> > > > KMeans?
>> > > >
>> > > > The cluster I am running consists of 10 machines, each with 8 cores
>> and
>> > > 68G
>> > > > of ram. I've configured the cluster to have each machine, at
>> maximum,
>> > run
>> > > 7
>> > > > map or reduce tasks. I set the map and reduce tasks to have
>> virtually
>> > no
>> > > > limit on memory consumption ... so with 7 processes each, at around
>> 9 -
>> > > 10G
>> > > > per process, the machines will crap out. I can reduce the number of
>> map
>> > > > tasks per machine, but something tells me that that level of memory
>> > > > consumption is wrong.
>> > > >
>> > > > If any more information is needed to help debug this, please let me
>> > know!
>> > > > Thanks!
>> > > >
>> > > > -- james
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message