Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3C43311A94 for ; Mon, 21 Apr 2014 17:04:46 +0000 (UTC) Received: (qmail 87959 invoked by uid 500); 21 Apr 2014 17:04:44 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 87890 invoked by uid 500); 21 Apr 2014 17:04:44 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@spark.apache.org Delivered-To: mailing list dev@spark.apache.org Received: (qmail 87878 invoked by uid 99); 21 Apr 2014 17:04:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Apr 2014 17:04:43 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.219.51] (HELO mail-oa0-f51.google.com) (209.85.219.51) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Apr 2014 17:04:40 +0000 Received: by mail-oa0-f51.google.com with SMTP id i4so4417795oah.24 for ; Mon, 21 Apr 2014 10:04:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=l6b8FIMWYCcJn6S5uSlXDnUIF/7ZLYdWvd2U7aOcZE8=; b=mnTQUKbKOO+kGZOhR1YAS0kYgJvvWc0MbVUnvtdbwuy+2/ywwR0Y+rhGXJg+igoqYB W+V6xfGv3ZWQ5HBLRKhBwqolGFjH6kzBBCgRzdthoH733QWilcOt41jLeBfCRnpp70vs sRAIPfXgoomMJMU3pYtPaa5DZdX6wGGs9YBGq/VRYAudsuDG//OVmwvhPMxwcj6+Vu4v kftCP++yIttu2viIhLTQhwBVSjZfWPl3BRJwfWUBv1woCkScYyE+ahcTiKMdFdt0tGwF VXB9tR8WiWJow+q0U5pWOQoklhgJSFtuMpwt0vEuBrpfFK04o8PafmPLHyZgnVHgNGKJ znkA== X-Gm-Message-State: ALoCoQlMactd63mqWQq80dT6JABZVfTNzoZZszGaTLYuNKSO2queOrgphvSvmAEPnH2pGN0I3v/f X-Received: by 10.60.51.227 with SMTP id n3mr33091390oeo.33.1398099855965; Mon, 21 Apr 2014 10:04:15 -0700 (PDT) MIME-Version: 1.0 Received: by 10.76.152.163 with HTTP; Mon, 21 Apr 2014 10:03:55 -0700 (PDT) In-Reply-To: References: From: Paul Brown Date: Mon, 21 Apr 2014 10:03:55 -0700 Message-ID: Subject: Re: Any plans for new clustering algorithms? To: dev@spark.apache.org Content-Type: multipart/alternative; boundary=001a11c2fdf64694fe04f7907dfd X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2fdf64694fe04f7907dfd Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I agree that it will be good to see more algorithms added to the MLlib universe, although this does bring to mind a couple of comments: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. - Not getting any signal out of your data with an algorithm like K-means implies one of the following: (1) there is no signal in your data, (2) you should try tuning the algorithm differently, (3) you're using K-means wrong, (4) you should try preparing the data differently, (5) all of the above, or (6) none of the above. My $0.02. -- Paul =E2=80=94 prb@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen wrote: > Nobody asked me, and this is a comment on a broader question, not this > one, but: > > In light of a number of recent items about adding more algorithms, > I'll say that I personally think an explosion of algorithms should > come after the MLlib "core" is more fully baked. I'm thinking of > finishing out the changes to vectors and matrices, for example. Things > are going to change significantly in the short term as people use the > algorithms and see how well the abstractions do or don't work. I've > seen another similar project suffer mightily from too many algorithms > too early, so maybe I'm just paranoid. > > Anyway, long-term, I think lots of good algorithms is a right and > proper goal for MLlib, myself. Consistent approaches, representations > and APIs will make or break MLlib much more than having or not having > a particular algorithm. With the plumbing in place, writing the algo > is the fun easy part. > -- > Sean Owen | Director, Data Science | London > > > On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka > wrote: > > Hi, Spark developers. > > Are there any plans for implementing new clustering algorithms in MLLib= ? > As > > far as I understand, current version of Spark ships with only one > > clustering algorithm - K-Means. I want to contribute to Spark and I'm > > thinking of adding more clustering algorithms - maybe > > DBSCAN. > > I can start working on it. Does anyone want to join me? > --001a11c2fdf64694fe04f7907dfd--