Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7283EF701 for ; Wed, 3 Apr 2013 01:04:32 +0000 (UTC) Received: (qmail 13292 invoked by uid 500); 3 Apr 2013 01:04:27 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 13148 invoked by uid 500); 3 Apr 2013 01:04:27 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 13140 invoked by uid 99); 3 Apr 2013 01:04:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Apr 2013 01:04:27 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of josh.wills@gmail.com designates 209.85.215.48 as permitted sender) Received: from [209.85.215.48] (HELO mail-la0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Apr 2013 01:04:21 +0000 Received: by mail-la0-f48.google.com with SMTP id fq13so936184lab.7 for ; Tue, 02 Apr 2013 18:04:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=7L4Sp4HqGHgqKVSPOEm7qez14tEwT5MgpqmVQiyxV0M=; b=SHuz4T5rIKIZVKn9EphYd9mE0lrVmnXx0fFSl7x/O5gczMSzmHSV0pysSPLDvkOqg1 iMRap1YN9XCEsv2ESKeuUCsy9fDRKqTWwoC06hyToGPRfk3C2Eo8YA0F6oA+WunuqHR8 0dpzORDO/lnkpxsypYDMfInh1IjszU7Tok4bqG3uE4N5zavrYFjMbDgfQtKxvlNOJ8je aClH0/4Rc/8Ol4OQZYHOKPkkhB9ftCR30qeBWuQBdNJTf3q4ZbP68QATEU8PdDl/43mt 9jcqmOuubtIZSeyv6nx2j6oGqkDvsbuUmUFr6cvgGMdyThai6bZu7aevDIfE/kwRIl9r Ec/g== X-Received: by 10.112.6.234 with SMTP id e10mr40175lba.46.1364951041206; Tue, 02 Apr 2013 18:04:01 -0700 (PDT) MIME-Version: 1.0 Received: by 10.112.71.172 with HTTP; Tue, 2 Apr 2013 18:03:40 -0700 (PDT) In-Reply-To: <22C27BA5-A2E2-43BF-B1F9-6ED942A4C6CE@gmail.com> References: <22C27BA5-A2E2-43BF-B1F9-6ED942A4C6CE@gmail.com> From: Josh Wills Date: Tue, 2 Apr 2013 18:03:40 -0700 Message-ID: Subject: =?ISO-8859-1?Q?Re=3A_Na=EFve_k=2Dmeans_using_hadoop?= To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=14dae94732cbf2703004d96a6d1d X-Virus-Checked: Checked by ClamAV on apache.org --14dae94732cbf2703004d96a6d1d Content-Type: text/plain; charset=ISO-8859-1 A couple of folks pointed me to this thread to ask if I had lifted the k-means algorithm in ML from Mahout's implementation. For the record, I did not; the implementation in ML is based on the iterative k-means|| algorithm described in Bahmani et al. (2012): http://arxiv.org/abs/1203.6402 whereas the Mahout impl (MAHOUT-1154) is based on the single-pass algorithm described in Shindler et al. (2011): http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf For what it's worth, I point this out in the original blog post: http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/ Also for what it's worth, I'm eager to try out the single-pass k-means algorithm as soon as it's actually committed to Mahout and the 0.8 release comes out; my primary interest is in helping people choose good values of K building on the kind of data sketching techniques outlined in these algorithms. Submitting ML to Mahout didn't seem like a great idea, given that it would have added a dependency on Crunch from Mahout. The Crunch project spends a fair amount of time doing battle with dependency conflicts, and I wouldn't want to make that situation any worse for another project, esp. by doing it via an unsolicited and massive patch. J On Wed, Mar 27, 2013 at 10:37 AM, Mark Miller wrote: > > On Mar 27, 2013, at 12:47 PM, Ted Dunning wrote: > > > And, of course, due credit should be given here. The advanced > clustering algorithms in Crunch were lifted from the new stuff in Mahout > pretty much step for step. > > > > The Mahout group would have loved to have contributions from the > Cloudera guys instead of re-implementation, but you can't legislate taste. > > > > LOL - that's so ironic that I had to check my Calendar. Nope, not quite > April 1st yet ;) > > Made my day. > > - Mark --14dae94732cbf2703004d96a6d1d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
A couple of folks pointed me to this thread to ask if I ha= d lifted the k-means algorithm in ML from Mahout's implementation. For = the record, I did not; the implementation in ML is based on the iterative k= -means|| algorithm described in Bahmani et al. (2012):


Also for= what it's worth, I'm eager to try out the single-pass k-means algo= rithm as soon as it's actually committed to Mahout and the 0.8 release = comes out; my primary interest is in helping people choose good values of K= building on the kind of data sketching techniques outlined in these algori= thms.

Submitting ML to Mahout didn't seem like a great id= ea, given that it would have added a dependency on Crunch from Mahout. The = Crunch project spends a fair amount of time doing battle with dependency co= nflicts, and I wouldn't want to make that situation any worse for anoth= er project, esp. by doing it via an unsolicited and massive patch.

J


On Wed, Mar 27, 2013 at 10:37 AM, Mark Miller <mar= krmiller@gmail.com> wrote:

On Mar 27, 2013, at 12:47 PM, Ted Dunning <tdunning@maprtech.com> wrote:

> And, of course, due credit should be given here. =A0The advanced clust= ering algorithms in Crunch were lifted from the new stuff in Mahout pretty = much step for step.
>
> The Mahout group would have loved to have contributions from the Cloud= era guys instead of re-implementation, but you can't legislate taste. >

LOL - that's so ironic that I had to check my Calendar. Nope, not= quite April 1st yet ;)

Made my day.

- Mark

--14dae94732cbf2703004d96a6d1d--