Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@spark.apache.org
Received-SPF: pass (athena.apache.org: domain of rnowling@gmail.com designates
 74.125.82.174 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAPud8ToohXQyYWqyL0d4k-i00fhbbjdQj2AHXJWqseigg7j2XQ@mail.gmail.com>
References: 
 <CADtDQQ+dEVcaQyZ1WXuyyijNJ+i=O3k0D+W6oWnzpPxP-oTNCw@mail.gmail.com>
	<CAPi87hdGKYsjyw2LrPXh=3RZZSbJ=cEW3TsFppWM6Jp5CnM_qQ@mail.gmail.com>
	<CAPud8To2vY5QeGffxkpP-snE4Pt-TmHAfsbeA8RQhqO98ZDBfg@mail.gmail.com>
	<CAPi87hcYBHckVkQr3+K1N7BH5f1sEUqGFpak+wuGk5Wi1agSYw@mail.gmail.com>
	<CAPud8ToohXQyYWqyL0d4k-i00fhbbjdQj2AHXJWqseigg7j2XQ@mail.gmail.com>
Date: Tue, 8 Jul 2014 16:59:59 -0400
Message-ID: 
 <CADtDQQKdJRPAxS8EJVgJbZOGa97_25GVrJ3tQpKwaOZY+q+HBw@mail.gmail.com>
Subject: Re: Contributing to MLlib: Proposal for Clustering Algorithms
From: RJ Nowling <rnowling@gmail.com>
To: dev@spark.apache.org
Content-Type: text/plain; charset=UTF-8

The scikit-learn implementation may be of interest:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward

It's a bottom up approach.  The pair of clusters for merging are
chosen to minimize variance.

Their code is under a BSD license so it can be used as a template.

Is something like that you were thinking Hector?

On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> sure. more interesting problem here is choosing k at each level. Kernel
> methods seem to be most promising.
>
>
> On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <hector.yee@gmail.com> wrote:
>
>> No idea, never looked it up. Always just implemented it as doing k-means
>> again on each cluster.
>>
>> FWIW standard k-means with euclidean distance has problems too with some
>> dimensionality reduction methods. Swapping out the distance metric with
>> negative dot or cosine may help.
>>
>> Other more useful clustering would be hierarchical SVD. The reason why I
>> like hierarchical clustering is it makes for faster inference especially
>> over billions of users.
>>
>>
>> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>>
>> > Hector, could you share the references for hierarchical K-means? thanks.
>> >
>> >
>> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <hector.yee@gmail.com> wrote:
>> >
>> > > I would say for bigdata applications the most useful would be
>> > hierarchical
>> > > k-means with back tracking and the ability to support k nearest
>> > centroids.
>> > >
>> > >
>> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowling@gmail.com>
>> wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > MLlib currently has one clustering algorithm implementation, KMeans.
>> > > > It would benefit from having implementations of other clustering
>> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
>> > > > Clustering, and Affinity Propagation.
>> > > >
>> > > > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
>> > > > and I saw an email on this list about interest in implementing Fuzzy
>> > > > C-Means.
>> > > >
>> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
>> > > > apparent that before I implement more clustering algorithms, it would
>> > > > be useful to hammer out a framework to reduce code duplication and
>> > > > implement a consistent API.
>> > > >
>> > > > I'd like to gauge the interest and goals of the MLlib community:
>> > > >
>> > > > 1. Are you interested in having more clustering algorithms available?
>> > > >
>> > > > 2. Is the community interested in specifying a common framework?
>> > > >
>> > > > Thanks!
>> > > > RJ
>> > > >
>> > > > [1] - https://github.com/apache/spark/pull/1248
>> > > >
>> > > >
>> > > > --
>> > > > em rnowling@gmail.com
>> > > > c 954.496.2314
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Yee Yang Li Hector <http://google.com/+HectorYee>
>> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
>> > >
>> >
>>
>>
>>
>> --
>> Yee Yang Li Hector <http://google.com/+HectorYee>
>> *google.com/+HectorYee <http://google.com/+HectorYee>*
>>


-- 
em rnowling@gmail.com
c 954.496.2314