Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mahout.apache.org
Received-SPF: pass (athena.apache.org: domain of chalithaudara@gmail.com
 designates 209.85.128.180 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <53294CC9.3010803@apache.org>
References: 
 <CA+WoyW=yx5b-0_yMg6-U-qFn7hZMtqjujCiyLTLOwoo3yjgHiw@mail.gmail.com>
	<CAPud8Tq2s4+21pGWauwMUP780LX7fp8gCBmtQuwEU1gNA8e2-g@mail.gmail.com>
	<CA+WoyW=aAc8eysrxwZFRUHLX=8vPGJxbvipXdAT5eYsB=A3OWw@mail.gmail.com>
	<CAPud8TqH_rFo=-PSu0TBZZoRCYMioPtHU6+TE_ixXM5oWeXFWQ@mail.gmail.com>
	<CA+WoyWnd+GMpU89Lka8iFC4dN6EJ1YsaaKO=pWe0cyWy3MJiAQ@mail.gmail.com>
	<CAPud8Tpw3zcE=YEX2GK+ziTywXCg=62av-8j2RzwF5-xFnDf+w@mail.gmail.com>
	<1395213229.46619.YahooMailNeo@web163502.mail.gq1.yahoo.com>
	<CA+WoyWmvdBR2MMcAAowE7KBq5+ubrUJx4y-qVFez-EX+7UbRQA@mail.gmail.com>
	<53294CC9.3010803@apache.org>
Date: Fri, 21 Mar 2014 17:28:18 +0530
Message-ID: 
 <CA+WoyWk6UH-QUzqyB1ugdYBJBgg+Q1jEbKFXmkTC5azkuh7W3A@mail.gmail.com>
Subject: Re: [GSOC 2014] Uniform API for Mahout Clustering
From: chalitha udara Perera <chalithaudara@gmail.com>
To: dev@mahout.apache.org, ssc@apache.org
Content-Type: multipart/alternative; boundary=047d7b3a8ab605688004f51c9a3b

--047d7b3a8ab605688004f51c9a3b
Content-Type: text/plain; charset=ISO-8859-1

Hi everyone,

I have submitted the proposal [1]. Thanks a lot everyone for valuable
insights.
I would greatly appreciate if you can take few minutes to review it.

[1]
https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/chalitha_perera/5629499534213120

Thanks.
Chalitha


On Wed, Mar 19, 2014 at 1:22 PM, Sebastian Schelter <ssc@apache.org> wrote:

> It's not about directly porting algorithms to Spark, its about porting
> them to a DSL that executes on top of Spark. This page has information
> about it:
>
> https://mahout.apache.org/users/sparkbindings/home.html
>
> --sebastian
>
>
> On 03/19/2014 08:43 AM, chalitha udara Perera wrote:
>
>> Thanks a lot everyone for valuable insights. Since now the main focus is
>> on
>> porting to Spark, I would be really happy to get involved with it. Can you
>> give me more information on current progress with porting, specially
>> regrading clustering component.
>>
>> Regards,
>> Chalitha
>>
>>
>> On Wed, Mar 19, 2014 at 12:43 PM, Suneel Marthi <suneel_marthi@yahoo.com>
>> wrote:
>>
>>
>>>
>>>
>>>
>>>
>>> On Wednesday, March 19, 2014 3:09 AM, Dmitriy Lyubimov <
>>> dlieu.7@gmail.com>
>>> wrote:
>>>
>>> On Tue, Mar 18, 2014 at 11:56 PM, chalitha udara Perera <
>>> chalithaudara@gmail.com> wrote:
>>>
>>>  Hi Dmitriy,
>>>>
>>>> I agree with you that i need to be more specific on this matter. Here I
>>>>
>>> was
>>>
>>>> referring to some suggestion given by Suneel on Mahout 1.0 goals [1], b
>>>>
>>> and
>>>
>>>> c.
>>>>
>>>>  He mainly speaks of test coverage there and REST exposition.  What you
>>> saying is a bit more ambitious IMO.
>>>
>>> This was long before the discussion of H2O and Spark had come up. In a
>>> later email, I had also mentioned uniform interfaces for API and porting
>>> stuff to Spark.
>>>
>>>>
>>>> For example this is one thing i have experienced while using mahout
>>>> clustering. I have used both simple kmeans and spectral kmeans and for
>>>> simple kmeans input is the sequence file containing the tfidf vectors of
>>>> the documents while for spectral kmeans it is a csv file defining the
>>>> similarity matrix. It would have been much easier for users if spectral
>>>> kmeans also takes the tfidf vectors and create the similarity matrix
>>>> internally. I think that would improve the usability.
>>>>
>>>>  I don't think clustering is tf-idf specic. I think this is a chance for
>>> proper componentization of concerns here.
>>>
>>> Agree with Dmitriy here.
>>>
>>>
>> Totally Agree. I was just trying to give an example of uniformity.
>>
>>
>>>
>>>> And most of these algorithms are designed to run via the command line. I
>>>> know currently lot of programmers just use run(String []) method for
>>>> programming. I am not saying it is impossible to use Mahout clustering
>>>> algorithms as required. but it takes some effort, most of the you need
>>>> to
>>>> dive into the code internals to use it properly and most of the people
>>>>
>>> are
>>>
>>>> not going to do that. Please provide your valuable insight on this.
>>>>
>>>> I also really interested in the new direction mahout is heading with
>>>>
>>> Spark
>>>
>>>> given that interest for Spark will only grow largely in near future. If
>>>>
>>> you
>>>
>>>> think implementing some of clustering algorithms for example simple
>>>>
>>> kmeans
>>>
>>>> to support spark is more important for next release, I would be happy to
>>>> work on that.
>>>>
>>>>
>>> I would be happy to see you give a try there, too.
>>>
>>>
>>
>>>> Regards,
>>>> Chalitha
>>>>
>>>> [1]
>>>>
>>>>
>>>>  http://mail-archives.apache.org/mod_mbox/mahout-dev/
>>> 201402.mbox/%3C1393554632.3930.YahooMailNeo@web160202.
>>> mail.bf1.yahoo.com%3E
>>>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 19, 2014 at 11:39 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>>  I think you need to be a little bit more specific as to what you are
>>>>> proposing exactly.  I think "uniform clustering api" needs a bit of
>>>>> elaboration. I, generally, cannot say that I experienced any pain
>>>>>
>>>> calling
>>>
>>>> out clustering algorithms say in R as a well-documented function. In
>>>>>
>>>> Mahout
>>>>
>>>>> just doing the same was primarily a pain; but assuming one can call it
>>>>>
>>>> with
>>>>
>>>>> ease and even interactively, I can't say I experienced any major
>>>>> inconvenience with just doing this.
>>>>>
>>>>> I guess one can see that one can abstract away notions of clusters and
>>>>> clustering output, but I don't have enough experience to tell whether
>>>>>
>>>> it
>>>
>>>> is
>>>>
>>>>> a good idea to cover _any_ possible clustering methodology.
>>>>>
>>>>>
>>>>> On Tue, Mar 18, 2014 at 10:50 PM, chalitha udara Perera <
>>>>> chalithaudara@gmail.com> wrote:
>>>>>
>>>>>  Hi everyone,
>>>>>>
>>>>>> Greatly appreciate your interest on this issue. I have gone through
>>>>>>
>>>>> the
>>>
>>>> document ScalaSparkBindings [1] . In this project my initial idea was
>>>>>>
>>>>> to
>>>>
>>>>> provide high level API for end user programmers so that they have the
>>>>>> flexibility of plugin in different types of algorithms without
>>>>>>
>>>>> concerning
>>>>
>>>>> about underline details of different types of inputs or outputs.
>>>>>>
>>>>> Also I
>>>
>>>> consider providing proper test coverage for all clustering algorithm
>>>>>>
>>>>> is a
>>>>
>>>>> must for the 1.0 release.
>>>>>>
>>>>>> Would like to get your opinion regarding this and little more detail
>>>>>>
>>>>> on
>>>
>>>> current requirements for clustering would help me to improve
>>>>>>
>>>>> proposal.
>>>
>>>>
>>>>>> Thanks,
>>>>>> Chalitha
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 17, 2014 at 11:21 PM, Dmitriy Lyubimov <
>>>>>>
>>>>> dlieu.7@gmail.com
>>>
>>>> wrote:
>>>>>>>
>>>>>>
>>>>>>  Yes. there's interest.
>>>>>>> Note that we are trying to unify linear algebra primitives and
>>>>>>>
>>>>>> optimization
>>>>>>
>>>>>>> on Spark as well. All new linear algebra and interaction with spark
>>>>>>>
>>>>>> context
>>>>>>
>>>>>>> should probably go thru this layer. This is ongoing thing but some
>>>>>>>
>>>>>> stuff
>>>>>
>>>>>> is
>>>>>>
>>>>>>> working [1]
>>>>>>>
>>>>>>> [1] mAHOUT-1346 https://issues.apache.org/jira/browse/MAHOUT-1346
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 17, 2014 at 10:37 AM, chalitha udara Perera <
>>>>>>> chalithaudara@gmail.com> wrote:
>>>>>>>
>>>>>>>  Hi All,
>>>>>>>>
>>>>>>>> Going through the mail tread Mahout 1.0 goals, I found that the
>>>>>>>>
>>>>>>> main
>>>>
>>>>> focus
>>>>>>>
>>>>>>>> of mahout is now towards the code re-factoring and integration
>>>>>>>>
>>>>>>> with
>>>
>>>> Spark
>>>>>>
>>>>>>> rather than implementing new algorithms. Recently I have used
>>>>>>>>
>>>>>>> mahout
>>>>
>>>>> for
>>>>>>
>>>>>>> implementing document clustering module a Content Management
>>>>>>>>
>>>>>>> System.
>>>>
>>>>>
>>>>>>>> To be honest we had some problems with lack of uniformity among
>>>>>>>>
>>>>>>> different
>>>>>>
>>>>>>> clustering algorithms. For example simple Kmeans takes input as
>>>>>>>>
>>>>>>> the
>>>
>>>> sequence file with document TF-IDF vectors, while Spectral Kmeans
>>>>>>>>
>>>>>>> takes
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> csv file that defines the similarity matrix.
>>>>>>>>
>>>>>>>> I think if we can provide a uniform clustering API as mentioned
>>>>>>>>
>>>>>>> in
>>>
>>>> 1.0
>>>>>
>>>>>> goals, it would be very useful for end user developers.
>>>>>>>>
>>>>>>>> I would like to proceed with this idea as my GSOC 2014 project.
>>>>>>>>
>>>>>>> Please
>>>>>
>>>>>> let
>>>>>>>
>>>>>>>> me know if you are interested in this project
>>>>>>>> --
>>>>>>>> J.M Chalitha Udara Perera
>>>>>>>>
>>>>>>>> *Department of Computer Science and Engineering,*
>>>>>>>> *University of Moratuwa,*
>>>>>>>> *Sri Lanka*
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> J.M Chalitha Udara Perera
>>>>>>
>>>>>> *Department of Computer Science and Engineering,*
>>>>>> *University of Moratuwa,*
>>>>>> *Sri Lanka*
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> J.M Chalitha Udara Perera
>>>>
>>>> *Department of Computer Science and Engineering,*
>>>> *University of Moratuwa,*
>>>> *Sri Lanka*
>>>>
>>>>
>>>
>>
>>
>>
>


-- 
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*

--047d7b3a8ab605688004f51c9a3b--