Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@spark.apache.org
Received-SPF: pass (athena.apache.org: local policy includes SPF record at
 spf.trusted-forwarder.org)
MIME-Version: 1.0
In-Reply-To: 
 <CAMAsSdJ5em5SgjszP-NiA58u_NR_tgT2PC6Wwxi4W8VX0cvVgA@mail.gmail.com>
References: 
 <CADSNrJHOT035G1xda8p9eZz-90uiWOb4CwgiY1r-d8m9V9LM+g@mail.gmail.com>
 <CAMAsSdJ5em5SgjszP-NiA58u_NR_tgT2PC6Wwxi4W8VX0cvVgA@mail.gmail.com>
From: Paul Brown <prb@mult.ifario.us>
Date: Mon, 21 Apr 2014 10:03:55 -0700
Message-ID: 
 <CACArsZ8w=mRKMwPyG9+9Kyyn-kVtOj1E2ujC6p_Uv_X-86m1Tg@mail.gmail.com>
Subject: Re: Any plans for new clustering algorithms?
To: dev@spark.apache.org
Content-Type: multipart/alternative; boundary=001a11c2fdf64694fe04f7907dfd

--001a11c2fdf64694fe04f7907dfd
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I agree that it will be good to see more algorithms added to the MLlib
universe, although this does bring to mind a couple of comments:

- MLlib as Mahout.next would be a unfortunate.  There are some gems in
Mahout, but there are also lots of rocks.  Setting a minimal bar of
working, correctly implemented, and documented requires a surprising amount
of work.

- Not getting any signal out of your data with an algorithm like K-means
implies one of the following: (1) there is no signal in your data, (2) you
should try tuning the algorithm differently, (3) you're using K-means
wrong, (4) you should try preparing the data differently, (5) all of the
above, or (6) none of the above.

My $0.02.
-- Paul


=E2=80=94
prb@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen <sowen@cloudera.com> wrote:

> Nobody asked me, and this is a comment on a broader question, not this
> one, but:
>
> In light of a number of recent items about adding more algorithms,
> I'll say that I personally think an explosion of algorithms should
> come after the MLlib "core" is more fully baked. I'm thinking of
> finishing out the changes to vectors and matrices, for example. Things
> are going to change significantly in the short term as people use the
> algorithms and see how well the abstractions do or don't work. I've
> seen another similar project suffer mightily from too many algorithms
> too early, so maybe I'm just paranoid.
>
> Anyway, long-term, I think lots of good algorithms is a right and
> proper goal for MLlib, myself. Consistent approaches, representations
> and APIs will make or break MLlib much more than having or not having
> a particular algorithm. With the plumbing in place, writing the algo
> is the fun easy part.
> --
> Sean Owen | Director, Data Science | London
>
>
> On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka
> <aliaksei.litouka@gmail.com> wrote:
> > Hi, Spark developers.
> > Are there any plans for implementing new clustering algorithms in MLLib=
?
> As
> > far as I understand, current version of Spark ships with only one
> > clustering algorithm - K-Means. I want to contribute to Spark and I'm
> > thinking of adding more clustering algorithms - maybe
> > DBSCAN<http://en.wikipedia.org/wiki/DBSCAN>.
> > I can start working on it. Does anyone want to join me?
>

--001a11c2fdf64694fe04f7907dfd--