Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (nike.apache.org: domain of pahomov.egor@gmail.com
 designates 209.85.192.53 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CABPQxsuJfL9PUYEKsaE=9YyaTfiB5Y5SbR8mAsLnWY1B9YCC+Q@mail.gmail.com>
References: 
 <CAMrx5DwF1c2FDbFcq0OHp2xppXZYtW=JN3GXXyPrZcuq7gTTNw@mail.gmail.com>
	<CAKxiPZO6F06dUmUDFVTRcc6sN+m2VZ8P2SXMgdyvA5eD2XSG4A@mail.gmail.com>
	<CAMrx5DwChP_b=kyMW5xF-45W+-01ndnj1GoGTFs=mEpuQB5ccg@mail.gmail.com>
	<CAMrx5DzjgQ2pQk3=5tQHBT+fAcbQ8kb+OPOfgpDZogRJqTWSRQ@mail.gmail.com>
	<CAPh_B=aPKPtF9vAbuCzUDDYhuQXZmCsBz5XaKmRt7QAP=oBGrQ@mail.gmail.com>
	<CAJgQjQ_bGfoutcosFKceW_FJZwRnRAApDsDyqtgQ5nhwUfAG0w@mail.gmail.com>
	<703329832.21829963.1410549021425.JavaMail.zimbra@redhat.com>
	<CABPQxsuJfL9PUYEKsaE=9YyaTfiB5Y5SbR8mAsLnWY1B9YCC+Q@mail.gmail.com>
Date: Mon, 15 Sep 2014 10:26:05 +0400
Message-ID: 
 <CAMrx5Dy+JFJCbK+yg94wCiH2_LNobkasNuiT97MjimmfGyxa-w@mail.gmail.com>
Subject: Re: Adding abstraction in MLlib
From: Egor Pahomov <pahomov.egor@gmail.com>
To: Patrick Wendell <pwendell@gmail.com>
Cc: Erik Erlandson <eje@redhat.com>, Xiangrui Meng <mengxr@gmail.com>,
 Reynold Xin <rxin@databricks.com>,
	Christoph Sawade <christoph.sawade@googlemail.com>,
	"dev@spark.apache.org" <dev@spark.apache.org>,
 Xiangrui Meng <meng@databricks.com>,
	Joseph Bradley <joseph@databricks.com>
Content-Type: multipart/alternative; boundary=001a11396236a7a02f050314b590

--001a11396236a7a02f050314b590
Content-Type: text/plain; charset=UTF-8

It's good, that databricks working on this issue! However current process
of working on that is not very clear for outsider.

   - Last update on this ticket is August 5. If all this time was active
   development, I have concerns that without feedback from community for such
   long time development can fall in wrong way.
   - Even if it would be great big patch as soon as you introduce new
   interfaces to community it would allow us to start working on our pipeline
   code. It would allow us write algorithm in new paradigm instead of in lack
   of any paradigms like it was before. It would allow us to help you transfer
   old code to new paradigm.

My main point - shorter iterations with more transparency.

I think it would be good idea to create some pull request with code, which
you have so far, even if it doesn't pass tests, so just we can comment on
it before formulating it in design doc.


2014-09-13 0:00 GMT+04:00 Patrick Wendell <pwendell@gmail.com>:

> We typically post design docs on JIRA's before major work starts. For
> instance, pretty sure SPARk-1856 will have a design doc posted
> shortly.
>
> On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson <eje@redhat.com> wrote:
> >
> > Are interface designs being captured anywhere as documents that the
> community can follow along with as the proposals evolve?
> >
> > I've worked on other open source projects where design docs were
> published as "living documents" (e.g. on google docs, or etherpad, but the
> particular mechanism isn't crucial).   FWIW, I found that to be a good way
> to work in a community environment.
> >
> >
> > ----- Original Message -----
> >> Hi Egor,
> >>
> >> Thanks for the feedback! We are aware of some of the issues you
> >> mentioned and there are JIRAs created for them. Specifically, I'm
> >> pushing out the design on pipeline features and algorithm/model
> >> parameters this week. We can move our discussion to
> >> https://issues.apache.org/jira/browse/SPARK-1856 .
> >>
> >> It would be nice to make tests against interfaces. But it definitely
> >> needs more discussion before making PRs. For example, we discussed the
> >> learning interfaces in Christoph's PR
> >> (https://github.com/apache/spark/pull/2137/) but it takes time to
> >> reach a consensus, especially on interfaces. Hopefully all of us could
> >> benefit from the discussion. The best practice is to break down the
> >> proposal into small independent piece and discuss them on the JIRA
> >> before submitting PRs.
> >>
> >> For performance tests, there is a spark-perf package
> >> (https://github.com/databricks/spark-perf) and we added performance
> >> tests for MLlib in v1.1. But definitely more work needs to be done.
> >>
> >> The dev-list may not be a good place for discussion on the design,
> >> could you create JIRAs for each of the issues you pointed out, and we
> >> track the discussion on JIRA? Thanks!
> >>
> >> Best,
> >> Xiangrui
> >>
> >> On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin <rxin@databricks.com>
> wrote:
> >> > Xiangrui can comment more, but I believe Joseph and him are actually
> >> > working on standardize interface and pipeline feature for 1.2 release.
> >> >
> >> > On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov <pahomov.egor@gmail.com
> >
> >> > wrote:
> >> >
> >> >> Some architect suggestions on this matter -
> >> >> https://github.com/apache/spark/pull/2371
> >> >>
> >> >> 2014-09-12 16:38 GMT+04:00 Egor Pahomov <pahomov.egor@gmail.com>:
> >> >>
> >> >> > Sorry, I misswrote  - I meant learners part of framework - models
> >> >> > already
> >> >> > exists.
> >> >> >
> >> >> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
> >> >> > christoph.sawade@googlemail.com>:
> >> >> >
> >> >> >> I totally agree, and we discovered also some drawbacks with the
> >> >> >> classification models implementation that are based on GLMs:
> >> >> >>
> >> >> >> - There is no distinction between predicting scores, classes, and
> >> >> >> calibrated scores (probabilities). For these models it is common
> to
> >> >> >> have
> >> >> >> access to all of them and the prediction function
> ``predict``should be
> >> >> >> consistent and stateless. Currently, the score is only available
> after
> >> >> >> removing the threshold from the model.
> >> >> >> - There is no distinction between multinomial and binomial
> >> >> >> classification. For multinomial problems, it is necessary to
> handle
> >> >> >> multiple weight vectors and multiple confidences.
> >> >> >> - Models are not serialisable, which makes it hard to use them in
> >> >> >> practise.
> >> >> >>
> >> >> >> I started a pull request [1] some time ago. I would be happy to
> >> >> >> continue
> >> >> >> the discussion and clarify the interfaces, too!
> >> >> >>
> >> >> >> Cheers, Christoph
> >> >> >>
> >> >> >> [1] https://github.com/apache/spark/pull/2137/
> >> >> >>
> >> >> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov <pahomov.egor@gmail.com>:
> >> >> >>
> >> >> >>> Here in Yandex, during implementation of gradient boosting in
> spark
> >> >> >>> and
> >> >> >>> creating our ML tool for internal use, we found next serious
> problems
> >> >> in
> >> >> >>> MLLib:
> >> >> >>>
> >> >> >>>
> >> >> >>>    - There is no Regression/Classification model abstraction. We
> were
> >> >> >>>    building abstract data processing pipelines, which should
> work just
> >> >> >>> with
> >> >> >>>    some regression - exact algorithm specified outside this code.
> >> >> >>>    There
> >> >> >>> is no
> >> >> >>>    abstraction, which will allow me to do that. *(It's main
> reason for
> >> >> >>> all
> >> >> >>>    further problems) *
> >> >> >>>    - There is no common practice among MLlib for testing
> algorithms:
> >> >> >>> every
> >> >> >>>    model generates it's own random test data. There is no easy
> >> >> >>> extractable
> >> >> >>>    test cases applible to another algorithm. There is no
> benchmarks
> >> >> >>>    for
> >> >> >>>    comparing algorithms. After implementing new algorithm it's
> very
> >> >> hard
> >> >> >>> to
> >> >> >>>    understand how it should be tested.
> >> >> >>>    - Lack of serialization testing: MLlib algorithms don't
> contain
> >> >> tests
> >> >> >>>    which test that model work after serialization.
> >> >> >>>    - During implementation of new algorithm it's hard to
> understand
> >> >> what
> >> >> >>>    API you should create and which interface to implement.
> >> >> >>>
> >> >> >>> Start for solving all these problems must be done in creating
> common
> >> >> >>> interface for typical algorithms/models - regression,
> classification,
> >> >> >>> clustering, collaborative filtering.
> >> >> >>>
> >> >> >>> All main tests should be written against these interfaces, so
> when new
> >> >> >>> algorithm implemented - all it should do is passed already
> written
> >> >> tests.
> >> >> >>> It allow us to have managble quality among all lib.
> >> >> >>>
> >> >> >>> There should be couple benchmarks which allow new spark user to
> get
> >> >> >>> feeling
> >> >> >>> about which algorithm to use.
> >> >> >>>
> >> >> >>> Test set against these abstractions should contain serialization
> test.
> >> >> In
> >> >> >>> production most time there is no need in model, which can't be
> stored.
> >> >> >>>
> >> >> >>> As the first step of this roadmap I'd like to create trait
> >> >> >>> RegressionModel,
> >> >> >>> *ADD* methods to current algorithms to implement this trait and
> create
> >> >> >>> some
> >> >> >>> tests against it. Planning of doing it next week.
> >> >> >>>
> >> >> >>> Purpose of this letter is to collect any objections to this
> approach
> >> >> >>> on
> >> >> >>> early stage: please give any feedback. Second reason is to set
> lock on
> >> >> >>> this
> >> >> >>> activity so we wouldn't do the same thing twice: I'll create pull
> >> >> request
> >> >> >>> by the end of the next week and any parallalizm in development
> we can
> >> >> >>> start
> >> >> >>> from there.
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> --
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >> > --
> >> >> >
> >> >> >
> >> >> >
> >> >> > *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >>
> >> >>
> >> >>
> >> >> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: dev-help@spark.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
>


-- 


*Sincerely yoursEgor PakhomovScala Developer, Yandex*

--001a11396236a7a02f050314b590--