Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0BD6511337 for ; Tue, 8 Jul 2014 21:00:26 +0000 (UTC) Received: (qmail 3397 invoked by uid 500); 8 Jul 2014 21:00:25 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 3342 invoked by uid 500); 8 Jul 2014 21:00:25 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@spark.apache.org Delivered-To: mailing list dev@spark.apache.org Received: (qmail 3330 invoked by uid 99); 8 Jul 2014 21:00:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Jul 2014 21:00:25 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rnowling@gmail.com designates 74.125.82.174 as permitted sender) Received: from [74.125.82.174] (HELO mail-we0-f174.google.com) (74.125.82.174) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Jul 2014 21:00:21 +0000 Received: by mail-we0-f174.google.com with SMTP id u57so6520794wes.5 for ; Tue, 08 Jul 2014 14:00:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=3fxf6G09oSSGnDcABp67ioOfTHNWeNUyBCDzPLeVB1M=; b=aURMFlLBhJiodPbh5Cqhu2vDKBhLLb+wJ0a4kOowswyDveX2MP0G96fy9dFQdoQLsc dbYcCcn9U1ybRCd5gaWjHSh4bau9VHhKp1iMqdUJbmFkE7QKTDTl+qrTMrI7KFR95RcB pHEf8s5IV81BSGxZknRVsNqf65RHFVujCKvKkDoFsJJ6MGVKYqyDydx0NlrfRNDTB5e+ Nz0wxiYsD7EUwvhc+WrHpCtk61mD8wQk4pyXeQZnM1n7Kgt1AVZ4UrZzBLEz5ypo319l OTB8Rt6lqbgdWicDIuIO3zTaTI8wowAn45JCYQT5heU6CLwcojB4DG5syrt3cVGv1voe HOqg== MIME-Version: 1.0 X-Received: by 10.194.48.8 with SMTP id h8mr6827552wjn.106.1404853199962; Tue, 08 Jul 2014 13:59:59 -0700 (PDT) Received: by 10.194.108.134 with HTTP; Tue, 8 Jul 2014 13:59:59 -0700 (PDT) In-Reply-To: References: Date: Tue, 8 Jul 2014 16:59:59 -0400 Message-ID: Subject: Re: Contributing to MLlib: Proposal for Clustering Algorithms From: RJ Nowling To: dev@spark.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as a template. Is something like that you were thinking Hector? On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov wrote: > sure. more interesting problem here is choosing k at each level. Kernel > methods seem to be most promising. > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee wrote: > >> No idea, never looked it up. Always just implemented it as doing k-means >> again on each cluster. >> >> FWIW standard k-means with euclidean distance has problems too with some >> dimensionality reduction methods. Swapping out the distance metric with >> negative dot or cosine may help. >> >> Other more useful clustering would be hierarchical SVD. The reason why I >> like hierarchical clustering is it makes for faster inference especially >> over billions of users. >> >> >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov >> wrote: >> >> > Hector, could you share the references for hierarchical K-means? thanks. >> > >> > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee wrote: >> > >> > > I would say for bigdata applications the most useful would be >> > hierarchical >> > > k-means with back tracking and the ability to support k nearest >> > centroids. >> > > >> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling >> wrote: >> > > >> > > > Hi all, >> > > > >> > > > MLlib currently has one clustering algorithm implementation, KMeans. >> > > > It would benefit from having implementations of other clustering >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical >> > > > Clustering, and Affinity Propagation. >> > > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans implementation, >> > > > and I saw an email on this list about interest in implementing Fuzzy >> > > > C-Means. >> > > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became >> > > > apparent that before I implement more clustering algorithms, it would >> > > > be useful to hammer out a framework to reduce code duplication and >> > > > implement a consistent API. >> > > > >> > > > I'd like to gauge the interest and goals of the MLlib community: >> > > > >> > > > 1. Are you interested in having more clustering algorithms available? >> > > > >> > > > 2. Is the community interested in specifying a common framework? >> > > > >> > > > Thanks! >> > > > RJ >> > > > >> > > > [1] - https://github.com/apache/spark/pull/1248 >> > > > >> > > > >> > > > -- >> > > > em rnowling@gmail.com >> > > > c 954.496.2314 >> > > > >> > > >> > > >> > > >> > > -- >> > > Yee Yang Li Hector >> > > *google.com/+HectorYee * >> > > >> > >> >> >> >> -- >> Yee Yang Li Hector >> *google.com/+HectorYee * >> -- em rnowling@gmail.com c 954.496.2314