Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AE62B1113C for ; Mon, 15 Sep 2014 06:26:33 +0000 (UTC) Received: (qmail 6750 invoked by uid 500); 15 Sep 2014 06:26:33 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 6679 invoked by uid 500); 15 Sep 2014 06:26:33 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 6662 invoked by uid 99); 15 Sep 2014 06:26:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Sep 2014 06:26:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of pahomov.egor@gmail.com designates 209.85.192.53 as permitted sender) Received: from [209.85.192.53] (HELO mail-qg0-f53.google.com) (209.85.192.53) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Sep 2014 06:26:06 +0000 Received: by mail-qg0-f53.google.com with SMTP id q108so3413419qgd.12 for ; Sun, 14 Sep 2014 23:26:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=afVM/0yl+mKcY4cJNC6acM1PSwYw14QmrbA8vMGoAl4=; b=tgvLTSJwWjzg0rcM/Q3012q0n1G/CdG7I1FImAh/mzYcEakmo3eBDdM6dBvrbbgHam 3T0xcsQAUu57z0K58a0jigQPF3ER1iQBroDbFSn8dv3XEjk0GyFH9gaHfVrmo+N0ud8f ZGXyj/1/s3dimMxwRF/A6/65dp0FloKCyz58qBAE9eSP5sB17aE6HEBSsutcCYPJq7RW xbrroVUcv5sZ043qIHuvtgKlHF1B83nK3yNe9zM3LwfhlWH7TQz2/+kge6Nsm8C+0CFB kmLSUQIN999pKmKO2mC6IRwfiiFg/mUixwYDxxzi7ciFZ2bBzru2I0N/9o/hbKGv8x5C ldvA== MIME-Version: 1.0 X-Received: by 10.140.93.230 with SMTP id d93mr34333413qge.53.1410762365442; Sun, 14 Sep 2014 23:26:05 -0700 (PDT) Received: by 10.140.36.137 with HTTP; Sun, 14 Sep 2014 23:26:05 -0700 (PDT) In-Reply-To: References: <703329832.21829963.1410549021425.JavaMail.zimbra@redhat.com> Date: Mon, 15 Sep 2014 10:26:05 +0400 Message-ID: Subject: Re: Adding abstraction in MLlib From: Egor Pahomov To: Patrick Wendell Cc: Erik Erlandson , Xiangrui Meng , Reynold Xin , Christoph Sawade , "dev@spark.apache.org" , Xiangrui Meng , Joseph Bradley Content-Type: multipart/alternative; boundary=001a11396236a7a02f050314b590 X-Virus-Checked: Checked by ClamAV on apache.org --001a11396236a7a02f050314b590 Content-Type: text/plain; charset=UTF-8 It's good, that databricks working on this issue! However current process of working on that is not very clear for outsider. - Last update on this ticket is August 5. If all this time was active development, I have concerns that without feedback from community for such long time development can fall in wrong way. - Even if it would be great big patch as soon as you introduce new interfaces to community it would allow us to start working on our pipeline code. It would allow us write algorithm in new paradigm instead of in lack of any paradigms like it was before. It would allow us to help you transfer old code to new paradigm. My main point - shorter iterations with more transparency. I think it would be good idea to create some pull request with code, which you have so far, even if it doesn't pass tests, so just we can comment on it before formulating it in design doc. 2014-09-13 0:00 GMT+04:00 Patrick Wendell : > We typically post design docs on JIRA's before major work starts. For > instance, pretty sure SPARk-1856 will have a design doc posted > shortly. > > On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson wrote: > > > > Are interface designs being captured anywhere as documents that the > community can follow along with as the proposals evolve? > > > > I've worked on other open source projects where design docs were > published as "living documents" (e.g. on google docs, or etherpad, but the > particular mechanism isn't crucial). FWIW, I found that to be a good way > to work in a community environment. > > > > > > ----- Original Message ----- > >> Hi Egor, > >> > >> Thanks for the feedback! We are aware of some of the issues you > >> mentioned and there are JIRAs created for them. Specifically, I'm > >> pushing out the design on pipeline features and algorithm/model > >> parameters this week. We can move our discussion to > >> https://issues.apache.org/jira/browse/SPARK-1856 . > >> > >> It would be nice to make tests against interfaces. But it definitely > >> needs more discussion before making PRs. For example, we discussed the > >> learning interfaces in Christoph's PR > >> (https://github.com/apache/spark/pull/2137/) but it takes time to > >> reach a consensus, especially on interfaces. Hopefully all of us could > >> benefit from the discussion. The best practice is to break down the > >> proposal into small independent piece and discuss them on the JIRA > >> before submitting PRs. > >> > >> For performance tests, there is a spark-perf package > >> (https://github.com/databricks/spark-perf) and we added performance > >> tests for MLlib in v1.1. But definitely more work needs to be done. > >> > >> The dev-list may not be a good place for discussion on the design, > >> could you create JIRAs for each of the issues you pointed out, and we > >> track the discussion on JIRA? Thanks! > >> > >> Best, > >> Xiangrui > >> > >> On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin > wrote: > >> > Xiangrui can comment more, but I believe Joseph and him are actually > >> > working on standardize interface and pipeline feature for 1.2 release. > >> > > >> > On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov > > >> > wrote: > >> > > >> >> Some architect suggestions on this matter - > >> >> https://github.com/apache/spark/pull/2371 > >> >> > >> >> 2014-09-12 16:38 GMT+04:00 Egor Pahomov : > >> >> > >> >> > Sorry, I misswrote - I meant learners part of framework - models > >> >> > already > >> >> > exists. > >> >> > > >> >> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade < > >> >> > christoph.sawade@googlemail.com>: > >> >> > > >> >> >> I totally agree, and we discovered also some drawbacks with the > >> >> >> classification models implementation that are based on GLMs: > >> >> >> > >> >> >> - There is no distinction between predicting scores, classes, and > >> >> >> calibrated scores (probabilities). For these models it is common > to > >> >> >> have > >> >> >> access to all of them and the prediction function > ``predict``should be > >> >> >> consistent and stateless. Currently, the score is only available > after > >> >> >> removing the threshold from the model. > >> >> >> - There is no distinction between multinomial and binomial > >> >> >> classification. For multinomial problems, it is necessary to > handle > >> >> >> multiple weight vectors and multiple confidences. > >> >> >> - Models are not serialisable, which makes it hard to use them in > >> >> >> practise. > >> >> >> > >> >> >> I started a pull request [1] some time ago. I would be happy to > >> >> >> continue > >> >> >> the discussion and clarify the interfaces, too! > >> >> >> > >> >> >> Cheers, Christoph > >> >> >> > >> >> >> [1] https://github.com/apache/spark/pull/2137/ > >> >> >> > >> >> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov : > >> >> >> > >> >> >>> Here in Yandex, during implementation of gradient boosting in > spark > >> >> >>> and > >> >> >>> creating our ML tool for internal use, we found next serious > problems > >> >> in > >> >> >>> MLLib: > >> >> >>> > >> >> >>> > >> >> >>> - There is no Regression/Classification model abstraction. We > were > >> >> >>> building abstract data processing pipelines, which should > work just > >> >> >>> with > >> >> >>> some regression - exact algorithm specified outside this code. > >> >> >>> There > >> >> >>> is no > >> >> >>> abstraction, which will allow me to do that. *(It's main > reason for > >> >> >>> all > >> >> >>> further problems) * > >> >> >>> - There is no common practice among MLlib for testing > algorithms: > >> >> >>> every > >> >> >>> model generates it's own random test data. There is no easy > >> >> >>> extractable > >> >> >>> test cases applible to another algorithm. There is no > benchmarks > >> >> >>> for > >> >> >>> comparing algorithms. After implementing new algorithm it's > very > >> >> hard > >> >> >>> to > >> >> >>> understand how it should be tested. > >> >> >>> - Lack of serialization testing: MLlib algorithms don't > contain > >> >> tests > >> >> >>> which test that model work after serialization. > >> >> >>> - During implementation of new algorithm it's hard to > understand > >> >> what > >> >> >>> API you should create and which interface to implement. > >> >> >>> > >> >> >>> Start for solving all these problems must be done in creating > common > >> >> >>> interface for typical algorithms/models - regression, > classification, > >> >> >>> clustering, collaborative filtering. > >> >> >>> > >> >> >>> All main tests should be written against these interfaces, so > when new > >> >> >>> algorithm implemented - all it should do is passed already > written > >> >> tests. > >> >> >>> It allow us to have managble quality among all lib. > >> >> >>> > >> >> >>> There should be couple benchmarks which allow new spark user to > get > >> >> >>> feeling > >> >> >>> about which algorithm to use. > >> >> >>> > >> >> >>> Test set against these abstractions should contain serialization > test. > >> >> In > >> >> >>> production most time there is no need in model, which can't be > stored. > >> >> >>> > >> >> >>> As the first step of this roadmap I'd like to create trait > >> >> >>> RegressionModel, > >> >> >>> *ADD* methods to current algorithms to implement this trait and > create > >> >> >>> some > >> >> >>> tests against it. Planning of doing it next week. > >> >> >>> > >> >> >>> Purpose of this letter is to collect any objections to this > approach > >> >> >>> on > >> >> >>> early stage: please give any feedback. Second reason is to set > lock on > >> >> >>> this > >> >> >>> activity so we wouldn't do the same thing twice: I'll create pull > >> >> request > >> >> >>> by the end of the next week and any parallalizm in development > we can > >> >> >>> start > >> >> >>> from there. > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> -- > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> *Sincerely yoursEgor PakhomovScala Developer, Yandex* > >> >> >>> > >> >> >> > >> >> >> > >> >> > > >> >> > > >> >> > -- > >> >> > > >> >> > > >> >> > > >> >> > *Sincerely yoursEgor PakhomovScala Developer, Yandex* > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> > >> >> > >> >> > >> >> *Sincerely yoursEgor PakhomovScala Developer, Yandex* > >> >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org > >> For additional commands, e-mail: dev-help@spark.apache.org > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org > > For additional commands, e-mail: dev-help@spark.apache.org > > > -- *Sincerely yoursEgor PakhomovScala Developer, Yandex* --001a11396236a7a02f050314b590--