mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Elkan triangle optimization for K-Means
Date Sat, 26 Mar 2011 08:02:31 GMT
The key thing about the Elkan trick is that clusters that don't move much
can be eliminated from consideration very quickly.  Since many cluster
motions will be quite small, this can have a much larger impact than things
like ball-trees.

On Fri, Mar 25, 2011 at 9:23 PM, Daniel McEnnis <dmcennis@gmail.com> wrote:

> Gustavo,
>
> My experience with massive data sets is that CPU is the wrong metric -
> memory use is the key.  The challenge is to get the operation to
> complete, not how long it will take.  I'll have to read the paper
> tomorrow to get a feel for its memory footprint.  I'm off to bed.
>
> Daniel.
>
> On Fri, Mar 25, 2011 at 6:22 PM, Gustavo Enrique Salazar Torres
> <gsalazar@ime.usp.br> wrote:
> > Daniel
> >
> > I guess that Ted is right, the paper below suggests that Elkan's
> > optimization is better than kd-trees:
> >
> > http://www.siam.org/proceedings/datamining/2010/dm10_012_hamerlyg.pdf
> >
> > Do you know any benchmark on ball-trees vs elkan's optimization?
> >
> > Ted, perhaps this optimized version of k-means should be used for high
> > dimensional data, that way distance computation will never be fast, what
> do
> > you think?
> > Still I will perform some tests.
> >
> > Best regards.
> > Gustavo
> >
> > http://www.siam.org/proceedings/datamining/2010/dm10_012_hamerlyg.pdf
> >
> > On Fri, Mar 25, 2011 at 5:47 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >>
> >> Again, the problem with large data is almost always I/O, not compute.
> >>
> >> Let's take an example with 100 centroids.  Assume inputs are sparse
> >> vectors
> >> with 100 non-sparse elements on average.
> >>
> >> The complete distance computation costs 100 floating point operations
> per
> >> vector.  Sparse vector routines can usually sustain
> >> a few hundred megaFlops single threaded and possibly a giga-flop with
> lots
> >> of threads.  Comparison against our 100 centroids
> >> thus will take 10K flops = 10us on average.  A vector is, however, 200 x
> >> (8
> >> + 4) > 2Kbytes long.  At 100 MB/s, this takes 20us
> >> to read which means that the program will be I/O limited for sure on
> >> non-local data (where hadoop has trouble saturating a 100MB/s
> >> link).  For local data, hadoop can sometimes sustain enough I/O to
> >> saturate
> >> the CPU, but just barely.
> >>
> >> If you are running on EC2, these factors will be much worse and there
> will
> >> be almost no chance of saturating the CPU.
> >>
> >> Being very clever about the distance computation may cut the cost by a
> >> small
> >> factor, but this is unlikely to make even 2:1 difference
> >> in the cost of the computation.  Even so, the cost of running each
> k-means
> >> iteration will be swamped by hadoop overheads.
> >>
> >> On the other-hand, building a non-hadoop version of k-means using
> >> something
> >> like S4 or twister or spark or haloop would allow
> >> speedups of as much as 20x simply by making the hadoop iterations much
> >> more
> >> efficient.  Committing to map-reduce v2 would
> >> allow similar speedups.  Another way to speed things up would be to
> allow
> >> multiple k-means iterations to happen in one pass
> >> over the data by implementing a gossip protocol that allows mappers to
> >> swap
> >> updates to centroids during a data pass.  On large
> >> data, full passes for each iteration are not very helpful, especially at
> >> the
> >> beginning of the process.
> >>
> >>
> >>
> >>
> >> On Fri, Mar 25, 2011 at 12:03 PM, Daniel McEnnis <dmcennis@gmail.com>
> >> wrote:
> >>
> >> > Dear Ted, Gustavo,
> >> >
> >> > If you don't mind me eavesdropping.... The standard way to cut down on
> >> > distance computations for a number of algorithms, including KMeans, is
> >> > to use Ball Trees.  Mapping these data structures into a map-reduce
> >> > framework will speed things up considerably.
> >> >
> >> > Sincerely,
> >> >
> >> > Daniel McEnnis.
> >> >
> >> > On Fri, Mar 25, 2011 at 12:53 PM, Ted Dunning <ted.dunning@gmail.com>
> >> > wrote:
> >> > > I think that this might make a good GSoC project.
> >> > >
> >> > > But you should do a bit more analysis to determine if this will
> >> > > actually
> >> > > speed things up.  There is a comment in the Elkan paper that if the
> >> > distance
> >> > > computation is fast that there may not be any speedup.  Likewise,
> >> > > there
> >> > may
> >> > > be no speedup if the algorithm is I/O bound.
> >> > >
> >> > >
> >> > >
> >> > > On Fri, Mar 25, 2011 at 7:12 AM, Gustavo Enrique Salazar Torres <
> >> > > tavoaqp@gmail.com> wrote:
> >> > >
> >> > >>  Hi Ted
> >> > >>
> >> > >> I finally got an stable version of this algorithm, although it's
> >> > >> still a
> >> > >> sequential version.
> >> > >> I was thinking about your suggestion to use a side file, I will
> have
> >> > >> to
> >> > >> test that option.
> >> > >>
> >> > >> Regarding storage issues, Elkan's algorithm does not use n^2 to
> store
> >> > all
> >> > >> distances, it just uses n*k since it has to remember lower bounds
> for
> >> > >> distances between every point and centroid.
> >> > >>
> >> > >> I also wanted to propose this algorithm as a GSOC project, what
do
> >> > >> you
> >> > >> think?
> >> > >>
> >> > >> Thanks!
> >> > >> Gustavo
> >> > >>
> >> > >> On 03/06/2011 08:04 PM, Ted Dunning wrote:
> >> > >>
> >> > >> In any case, can't you have the mappers write the distances you
> want
> >> > >> to
> >> > a
> >> > >> side file with some indexing information based on the offset in
the
> >> > original
> >> > >> data file?  This would allow subsequent maps to read from those
> side
> >> > files
> >> > >> and perform a map-side join since the cached data would (a)
> >> > >> definitely
> >> > be in
> >> > >> the same order as the original map file and (b) probably even
be on
> >> > >> the
> >> > same
> >> > >> machine.
> >> > >>
> >> > >>  The serious question is whether this will actually help or not.
> >> > >>  Typically with really large computations, the key speed factor
is
> >> > >> total
> >> > >> amount of I/O.  This optimization will speed up the computations,
> but
> >> > >> without storing the data in memory, the read bandwidth may be
low
> >> > >> enough
> >> > >> that increasing the amount of data you read could dominate the
> >> > >> compute
> >> > >> savings.
> >> > >>
> >> > >> On Sun, Mar 6, 2011 at 2:58 PM, Ted Dunning <ted.dunning@gmail.com
> >
> >> > wrote:
> >> > >>
> >> > >>> Any sort of complete distance matrix makes an algorithm unscalable
> >> > because
> >> > >>> storage size is n^2 in the number of items being clustered.
> >> > >>>
> >> > >>>  Is this the paper you are implicitly referring to?
> >> > >>>
> >> > >>>  http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8422
> >> > >>>
> >> > >>>  Do you really need to keep distances to all centers?  Or
just the
> >> > >>> distance to the closest center and the distances between centers?
> >> > >>>
> >> > >>> On Sun, Mar 6, 2011 at 1:39 PM, Gustavo Enrique Salazar Torres
<
> >> > >>> gsalazar@ime.usp.br> wrote:
> >> > >>>
> >> > >>>> Hi there
> >> > >>>>
> >> > >>>> I have implemented a version of K-Means using Elkan's
triangle
> >> > >>>> optimization
> >> > >>>> but I'm facing
> >> > >>>> OutOfMemory errors since it stores distances between all
points
> and
> >> > its
> >> > >>>> clusters.
> >> > >>>> Is there any way to distribute this matrix over a set
of Hadoop
> >> > >>>> nodes?
> >> > >>>>
> >> > >>>> Thanks.
> >> > >>>>
> >> > >>>> --
> >> > >>>> Gustavo Salazar Torres, Phd(C).
> >> > >>>> ---------------------------------
> >> > >>>> "En la sencillez de mi corazon te he dado todo con alegría"
Mons.
> >> > Luigi
> >> > >>>> Giussani
> >> > >>>>
> >> > >>>> "When describing your own work, be humble and don’t
use
> >> > >>>> superlatives
> >> > of
> >> > >>>> praise, either
> >> > >>>> explicitly or implicitly, even if you are enthusiastic"
Donald
> >> > >>>> Knuth
> >> > >>>>
> >> > >>>
> >> > >>>
> >> > >>
> >> > >>
> >> > >
> >> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message