spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From RJ Nowling <rnowl...@gmail.com>
Subject Re: Contributing to MLlib: Proposal for Clustering Algorithms
Date Tue, 12 Aug 2014 18:20:40 GMT
Hi all,

I wanted to follow up.

I have a prototype for an optimized version of hierarchical k-means.  I
wanted to get some feedback on my apporach.

Jeremy's implementation splits the largest cluster in each round.  Is it
better to do it that way or to split each cluster in half?

Are there are any open-source examples that are being widely used in
production?

Thanks!



On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling <rnowling@gmail.com> wrote:

> Nice to meet you, Jeremy!
>
> This is great!  Hierarchical clustering was next on my list --
> currently trying to get my PR for MiniBatch KMeans accepted.
>
> If it's cool with you, I'll try converting your code to fit in with
> the existing MLLib code as you suggest. I also need to review the
> Decision Tree code (as suggested above) to see how much of that can be
> reused.
>
> Maybe I can ask you to do a code review for me when I'm done?
>
>
>
>
>
> On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
> <freeman.jeremy@gmail.com> wrote:
> > Hi all,
> >
> > Cool discussion! I agree that a more standardized API for clustering, and
> > easy access to underlying routines, would be useful (we've also been
> > discussing this when trying to develop streaming clustering algorithms,
> > similar to https://github.com/apache/spark/pull/1361)
> >
> > For divisive, hierarchical clustering I implemented something awhile
> back,
> > here's a gist.
> >
> > https://gist.github.com/freeman-lab/5947e7c53b368fe90371
> >
> > It does bisecting k-means clustering (with k=2), with a recursive class
> for
> > keeping track of the tree. I also found this much better than
> agglomerative
> > methods (for the reasons Hector points out).
> >
> > This needs to be cleaned up, and can surely be optimized (esp. by
> replacing
> > the core KMeans step with existing MLLib code), but I can say I was
> running
> > it successfully on quite large data sets.
> >
> > RJ, depending on where you are in your progress, I'd be happy to help
> work
> > on this piece and / or have you use this as a jumping off point, if
> useful.
> >
> > -- Jeremy
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>



-- 
em rnowling@gmail.com
c 954.496.2314

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message