hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Naïve k-means using hadoop
Date Wed, 03 Apr 2013 01:03:40 GMT
A couple of folks pointed me to this thread to ask if I had lifted the
k-means algorithm in ML from Mahout's implementation. For the record, I did
not; the implementation in ML is based on the iterative k-means|| algorithm
described in Bahmani et al. (2012):


whereas the Mahout impl (MAHOUT-1154) is based on the single-pass algorithm
described in Shindler et al. (2011):


For what it's worth, I point this out in the original blog post:


Also for what it's worth, I'm eager to try out the single-pass k-means
algorithm as soon as it's actually committed to Mahout and the 0.8 release
comes out; my primary interest is in helping people choose good values of K
building on the kind of data sketching techniques outlined in these

Submitting ML to Mahout didn't seem like a great idea, given that it would
have added a dependency on Crunch from Mahout. The Crunch project spends a
fair amount of time doing battle with dependency conflicts, and I wouldn't
want to make that situation any worse for another project, esp. by doing it
via an unsolicited and massive patch.


On Wed, Mar 27, 2013 at 10:37 AM, Mark Miller <markrmiller@gmail.com> wrote:

> On Mar 27, 2013, at 12:47 PM, Ted Dunning <tdunning@maprtech.com> wrote:
> > And, of course, due credit should be given here.  The advanced
> clustering algorithms in Crunch were lifted from the new stuff in Mahout
> pretty much step for step.
> >
> > The Mahout group would have loved to have contributions from the
> Cloudera guys instead of re-implementation, but you can't legislate taste.
> >
> LOL - that's so ironic that I had to check my Calendar. Nope, not quite
> April 1st yet ;)
> Made my day.
> - Mark

View raw message