mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Memory Issue with KMeans clustering
Date Mon, 07 Feb 2011 20:06:22 GMT
On Mon, Feb 7, 2011 at 11:35 AM, Robin Anil <robin.anil@gmail.com> wrote:

> On Tue, Feb 8, 2011 at 12:47 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > The problem is that the centroids are the average of many documents.
>  This
> > means that the number o non-zero elements in each centroid vector
> increases
> > as the number of documents increases.
> >
> > If we approximate the centroid by the point nearest to the centroid.
> Considering we have a lot of input data. I see the centroids being real
> points(part of the input dataset) instead of imaginary ones(average).  Some
> loss is incurred here
>

This also become much more computationally intense because you can't use
combiners.  Averages are really good about some stuff.


>
> Hashed encoding would be a easier solution. The same or similar loss is
> incurred here as well due to collisions.
>
>
Actually not.  If you have multiple probes, then hashed encoding is a form
of random projection and you typically will not lose any expressivity.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message