spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sethah <...@git.apache.org>
Subject [GitHub] spark pull request #17094: [SPARK-19762][ML] Hierarchy for consolidating ML ...
Date Tue, 28 Feb 2017 03:36:00 GMT
GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/17094

    [SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code

    ## What changes were proposed in this pull request?
    
    JIRA: [SPARK-19762](https://issues.apache.org/jira/browse/SPARK-19762)
    
    This patch is a WIP. 
    
    The larger changes in this patch are:
    
    * Adds a `DifferentiableLossAggregator` trait which is intended to be used as a common
parent trait to all Spark ML aggregator classes. It factors out the common methods: `merge,
gradient, loss, weight` from the aggregator subclasses.
    * Adds a `RDDLossFunction` which is intended to be the only implementation of Breeze's
`DiffFunction` necessary in Spark ML, and can be used by all other algorithms. It takes the
aggregator type as a type parameter, and maps the aggregator over an RDD. It additionally
takes in a optional regularization loss function for applying the differentiable part of regularization.
    * Factors out the regularization from the data part of the cost function, and treats regularization
as a separate independent cost function which can be evaluated and added to the data cost
function.
    * Changes `LinearRegression` to use this new hierarchy as a proof of concept.
    * Adds the following new namespaces `o.a.s.ml.optim.loss` and `o.a.s.ml.optim.aggregator`
    
    **NOTE: The large majority of the "lines added" and "lines deleted" are simply code moving
around or unit tests.**
    
    BTW, I also converted LinearSVC to this framework as a way to prove that this new hierarchy
is flexible enough for the other algorithms, but I backed those changes out because the PR
is large enough as is. 
    
    ## How was this patch tested?
    Test suites are added for the new components, and some test suites are also added to provide
coverage where there wasn't any before.
    
    * DifferentiablLossAggregatorSuite
    * LeastSquaresAggregatorSuite
    * RDDLossFunctionSuite
    * DifferentiableRegularizationSuite
    
    I would additionally like to run some performance/scale tests with linear regression to
ensure that there are no regressions. This patch is WIP until I can complete the tests. Since
the design will likely have some iteration, I'd like to have it open for review before the
scale tests are done.
    
    ## Follow ups
    
    If this design is accepted, we will convert the other ML algorithms that use this aggregator
pattern to this new hierarchy in follow up PRs. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark ml_aggregators

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17094.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17094
    
----
commit d6fae000d95284598e41d8bf95eb7067d8970e69
Author: sethah <seth.hendrickson16@gmail.com>
Date:   2017-02-27T19:03:03Z

    consolidate ml aggregators

commit 86b56001a82f43fe1342bb1c26c6edcce6523865
Author: sethah <seth.hendrickson16@gmail.com>
Date:   2017-02-27T20:29:14Z

    curried constructors

commit 06e547bdfb38d3b428a4a48c681aea989a11d625
Author: sethah <seth.hendrickson16@gmail.com>
Date:   2017-02-27T21:06:59Z

    self types and docs

commit c930ced63b5c1faebe8063c1bf90a26cf9fae2be
Author: sethah <seth.hendrickson16@gmail.com>
Date:   2017-02-27T22:25:27Z

    aggregator test suite

commit 6a596f23c855b2da0d9ba9133dee2f311dceb615
Author: sethah <seth.hendrickson16@gmail.com>
Date:   2017-02-27T23:03:16Z

    loss function suite

commit 4b36119652173fff30c5869694015e1519753a05
Author: sethah <seth.hendrickson16@gmail.com>
Date:   2017-02-27T23:50:24Z

    ls agg tests

commit ac55f06238cc9043ac2eaf282c3f8513a1a97076
Author: sethah <seth.hendrickson16@gmail.com>
Date:   2017-02-28T00:37:16Z

    all tests passing, still need tests for regularization

commit ab5151ea41cde7d898bd65b998f674da3a5975ea
Author: sethah <seth.hendrickson16@gmail.com>
Date:   2017-02-28T01:07:59Z

    regularization suite

commit 0366a8eefcef39c3251c9a7050944ada03bb4f47
Author: sethah <seth.hendrickson16@gmail.com>
Date:   2017-02-28T01:14:50Z

    backing out svc changes

commit 28b88e48027959e0574c9d13236daff44fcdf650
Author: sethah <seth.hendrickson16@gmail.com>
Date:   2017-02-28T01:50:56Z

    style cleanups and documentation

commit 9a04d0bc51bed29bca28a5e34ebc5b614b6560d2
Author: sethah <seth.hendrickson16@gmail.com>
Date:   2017-02-28T03:15:11Z

    tolerances and imports

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message