Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C82C3106E1 for ; Fri, 21 Mar 2014 11:58:51 +0000 (UTC) Received: (qmail 45989 invoked by uid 500); 21 Mar 2014 11:58:50 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 45014 invoked by uid 500); 21 Mar 2014 11:58:46 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 44997 invoked by uid 99); 21 Mar 2014 11:58:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Mar 2014 11:58:44 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of chalithaudara@gmail.com designates 209.85.128.180 as permitted sender) Received: from [209.85.128.180] (HELO mail-ve0-f180.google.com) (209.85.128.180) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Mar 2014 11:58:39 +0000 Received: by mail-ve0-f180.google.com with SMTP id jz11so2427566veb.11 for ; Fri, 21 Mar 2014 04:58:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=DF8ZvuvAapaBhTraEJzDEn+tRUeNV7YcL5j1O997tuo=; b=J/j6mrPC3T93fPT0TOl7jgOBYsQbFXUHemLo1RjZ7rgG1oX33SegupO7i5CxSTWqag eTV7HdSU+nhPxtjsXRNJQeRtsiIWK5Bgtq14tNfMZSNW6zfXfpS+MpeOGZ4qgmyqGTVL JkXjmb2KovjUE84JOnURqyjNY56k+YWcjRhNSl98XFk/qNRSIl1GE4fu6rBFrcND+RTf jtue+YGxEWkcMMIdIffcj+ihBfpp98Zx5/3OIoqcuqfO/r4s2e77B7/wKkvgVXtnl2O5 jZhUWD5NBBG9I0dGVDSfAeoQumDZHAMhUbSe77DVceyFlrQkDjPEVvhSl6UPWLA9hLwR FDhw== MIME-Version: 1.0 X-Received: by 10.220.69.133 with SMTP id z5mr78562vci.49.1395403098776; Fri, 21 Mar 2014 04:58:18 -0700 (PDT) Received: by 10.52.187.199 with HTTP; Fri, 21 Mar 2014 04:58:18 -0700 (PDT) In-Reply-To: <53294CC9.3010803@apache.org> References: <1395213229.46619.YahooMailNeo@web163502.mail.gq1.yahoo.com> <53294CC9.3010803@apache.org> Date: Fri, 21 Mar 2014 17:28:18 +0530 Message-ID: Subject: Re: [GSOC 2014] Uniform API for Mahout Clustering From: chalitha udara Perera To: dev@mahout.apache.org, ssc@apache.org Content-Type: multipart/alternative; boundary=047d7b3a8ab605688004f51c9a3b X-Virus-Checked: Checked by ClamAV on apache.org --047d7b3a8ab605688004f51c9a3b Content-Type: text/plain; charset=ISO-8859-1 Hi everyone, I have submitted the proposal [1]. Thanks a lot everyone for valuable insights. I would greatly appreciate if you can take few minutes to review it. [1] https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/chalitha_perera/5629499534213120 Thanks. Chalitha On Wed, Mar 19, 2014 at 1:22 PM, Sebastian Schelter wrote: > It's not about directly porting algorithms to Spark, its about porting > them to a DSL that executes on top of Spark. This page has information > about it: > > https://mahout.apache.org/users/sparkbindings/home.html > > --sebastian > > > On 03/19/2014 08:43 AM, chalitha udara Perera wrote: > >> Thanks a lot everyone for valuable insights. Since now the main focus is >> on >> porting to Spark, I would be really happy to get involved with it. Can you >> give me more information on current progress with porting, specially >> regrading clustering component. >> >> Regards, >> Chalitha >> >> >> On Wed, Mar 19, 2014 at 12:43 PM, Suneel Marthi >> wrote: >> >> >>> >>> >>> >>> >>> On Wednesday, March 19, 2014 3:09 AM, Dmitriy Lyubimov < >>> dlieu.7@gmail.com> >>> wrote: >>> >>> On Tue, Mar 18, 2014 at 11:56 PM, chalitha udara Perera < >>> chalithaudara@gmail.com> wrote: >>> >>> Hi Dmitriy, >>>> >>>> I agree with you that i need to be more specific on this matter. Here I >>>> >>> was >>> >>>> referring to some suggestion given by Suneel on Mahout 1.0 goals [1], b >>>> >>> and >>> >>>> c. >>>> >>>> He mainly speaks of test coverage there and REST exposition. What you >>> saying is a bit more ambitious IMO. >>> >>> This was long before the discussion of H2O and Spark had come up. In a >>> later email, I had also mentioned uniform interfaces for API and porting >>> stuff to Spark. >>> >>>> >>>> For example this is one thing i have experienced while using mahout >>>> clustering. I have used both simple kmeans and spectral kmeans and for >>>> simple kmeans input is the sequence file containing the tfidf vectors of >>>> the documents while for spectral kmeans it is a csv file defining the >>>> similarity matrix. It would have been much easier for users if spectral >>>> kmeans also takes the tfidf vectors and create the similarity matrix >>>> internally. I think that would improve the usability. >>>> >>>> I don't think clustering is tf-idf specic. I think this is a chance for >>> proper componentization of concerns here. >>> >>> Agree with Dmitriy here. >>> >>> >> Totally Agree. I was just trying to give an example of uniformity. >> >> >>> >>>> And most of these algorithms are designed to run via the command line. I >>>> know currently lot of programmers just use run(String []) method for >>>> programming. I am not saying it is impossible to use Mahout clustering >>>> algorithms as required. but it takes some effort, most of the you need >>>> to >>>> dive into the code internals to use it properly and most of the people >>>> >>> are >>> >>>> not going to do that. Please provide your valuable insight on this. >>>> >>>> I also really interested in the new direction mahout is heading with >>>> >>> Spark >>> >>>> given that interest for Spark will only grow largely in near future. If >>>> >>> you >>> >>>> think implementing some of clustering algorithms for example simple >>>> >>> kmeans >>> >>>> to support spark is more important for next release, I would be happy to >>>> work on that. >>>> >>>> >>> I would be happy to see you give a try there, too. >>> >>> >> >>>> Regards, >>>> Chalitha >>>> >>>> [1] >>>> >>>> >>>> http://mail-archives.apache.org/mod_mbox/mahout-dev/ >>> 201402.mbox/%3C1393554632.3930.YahooMailNeo@web160202. >>> mail.bf1.yahoo.com%3E >>> >>>> >>>> >>>> >>>> On Wed, Mar 19, 2014 at 11:39 AM, Dmitriy Lyubimov >>> >>>>> wrote: >>>>> >>>> >>>> I think you need to be a little bit more specific as to what you are >>>>> proposing exactly. I think "uniform clustering api" needs a bit of >>>>> elaboration. I, generally, cannot say that I experienced any pain >>>>> >>>> calling >>> >>>> out clustering algorithms say in R as a well-documented function. In >>>>> >>>> Mahout >>>> >>>>> just doing the same was primarily a pain; but assuming one can call it >>>>> >>>> with >>>> >>>>> ease and even interactively, I can't say I experienced any major >>>>> inconvenience with just doing this. >>>>> >>>>> I guess one can see that one can abstract away notions of clusters and >>>>> clustering output, but I don't have enough experience to tell whether >>>>> >>>> it >>> >>>> is >>>> >>>>> a good idea to cover _any_ possible clustering methodology. >>>>> >>>>> >>>>> On Tue, Mar 18, 2014 at 10:50 PM, chalitha udara Perera < >>>>> chalithaudara@gmail.com> wrote: >>>>> >>>>> Hi everyone, >>>>>> >>>>>> Greatly appreciate your interest on this issue. I have gone through >>>>>> >>>>> the >>> >>>> document ScalaSparkBindings [1] . In this project my initial idea was >>>>>> >>>>> to >>>> >>>>> provide high level API for end user programmers so that they have the >>>>>> flexibility of plugin in different types of algorithms without >>>>>> >>>>> concerning >>>> >>>>> about underline details of different types of inputs or outputs. >>>>>> >>>>> Also I >>> >>>> consider providing proper test coverage for all clustering algorithm >>>>>> >>>>> is a >>>> >>>>> must for the 1.0 release. >>>>>> >>>>>> Would like to get your opinion regarding this and little more detail >>>>>> >>>>> on >>> >>>> current requirements for clustering would help me to improve >>>>>> >>>>> proposal. >>> >>>> >>>>>> Thanks, >>>>>> Chalitha >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Mar 17, 2014 at 11:21 PM, Dmitriy Lyubimov < >>>>>> >>>>> dlieu.7@gmail.com >>> >>>> wrote: >>>>>>> >>>>>> >>>>>> Yes. there's interest. >>>>>>> Note that we are trying to unify linear algebra primitives and >>>>>>> >>>>>> optimization >>>>>> >>>>>>> on Spark as well. All new linear algebra and interaction with spark >>>>>>> >>>>>> context >>>>>> >>>>>>> should probably go thru this layer. This is ongoing thing but some >>>>>>> >>>>>> stuff >>>>> >>>>>> is >>>>>> >>>>>>> working [1] >>>>>>> >>>>>>> [1] mAHOUT-1346 https://issues.apache.org/jira/browse/MAHOUT-1346 >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 17, 2014 at 10:37 AM, chalitha udara Perera < >>>>>>> chalithaudara@gmail.com> wrote: >>>>>>> >>>>>>> Hi All, >>>>>>>> >>>>>>>> Going through the mail tread Mahout 1.0 goals, I found that the >>>>>>>> >>>>>>> main >>>> >>>>> focus >>>>>>> >>>>>>>> of mahout is now towards the code re-factoring and integration >>>>>>>> >>>>>>> with >>> >>>> Spark >>>>>> >>>>>>> rather than implementing new algorithms. Recently I have used >>>>>>>> >>>>>>> mahout >>>> >>>>> for >>>>>> >>>>>>> implementing document clustering module a Content Management >>>>>>>> >>>>>>> System. >>>> >>>>> >>>>>>>> To be honest we had some problems with lack of uniformity among >>>>>>>> >>>>>>> different >>>>>> >>>>>>> clustering algorithms. For example simple Kmeans takes input as >>>>>>>> >>>>>>> the >>> >>>> sequence file with document TF-IDF vectors, while Spectral Kmeans >>>>>>>> >>>>>>> takes >>>>> >>>>>> the >>>>>>> >>>>>>>> csv file that defines the similarity matrix. >>>>>>>> >>>>>>>> I think if we can provide a uniform clustering API as mentioned >>>>>>>> >>>>>>> in >>> >>>> 1.0 >>>>> >>>>>> goals, it would be very useful for end user developers. >>>>>>>> >>>>>>>> I would like to proceed with this idea as my GSOC 2014 project. >>>>>>>> >>>>>>> Please >>>>> >>>>>> let >>>>>>> >>>>>>>> me know if you are interested in this project >>>>>>>> -- >>>>>>>> J.M Chalitha Udara Perera >>>>>>>> >>>>>>>> *Department of Computer Science and Engineering,* >>>>>>>> *University of Moratuwa,* >>>>>>>> *Sri Lanka* >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> J.M Chalitha Udara Perera >>>>>> >>>>>> *Department of Computer Science and Engineering,* >>>>>> *University of Moratuwa,* >>>>>> *Sri Lanka* >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> J.M Chalitha Udara Perera >>>> >>>> *Department of Computer Science and Engineering,* >>>> *University of Moratuwa,* >>>> *Sri Lanka* >>>> >>>> >>> >> >> >> > -- J.M Chalitha Udara Perera *Department of Computer Science and Engineering,* *University of Moratuwa,* *Sri Lanka* --047d7b3a8ab605688004f51c9a3b--