spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From derrickburns <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Date Thu, 02 Oct 2014 20:15:54 GMT
GitHub user derrickburns opened a pull request:

    https://github.com/apache/spark/pull/2634

    [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-3424] [RESUBMIT] MLLIB K-Means Clusterer

    This commit introduces a general distance function trait, `PointOps`, for the Spark K-Means
clusterer. There are no public API changes*.  
    
    ### Issue - Data Capture Test Fails - NEED HELP
    
    The `org.apache.spark.mllib.clustering.KMeansClusterSuite` "task size should be small
in both training and prediction" fails, suggesting that the RDD data is being captured in
a closure.  This is quite puzzling. My efforts to solve this problem have failed. I *need
help* to solve this problem. 
    
    ### Distance Function Trait
    
    The `PointOps` trait defines the distance function.  The `PointOps` trait is more than
a simple distance function.  It also defines the types of Points and Centers for the clusterer.
 Standard MLLIB `Vector`s are converted into Points and Centers.  In the case of the `FastEuclideanOps`
implementation of `PointOps`, the Point and Center types includes vector norm members. In
other distance functions such as the [Kullback-Leibler distance function](http://en.wikipedia.org/wiki/Bregman_divergence),
the Point and Center types include different values that speed up the distance calculation
in a similar way that caching vector norms speeds up the Euclidean distance function.  This
addresses SPARK-3219.
    
    ### Refactoring
    
    To understand this original code, I found it useful to refactor the original implementation
into components.  You may find it helpful to understand this pull request by looking at the
new components and comparing them to their original implementation.  Unfortunately, GitHub
diff does not help very much with this.
    
    This commit splits up the clusterer into a number of components which behave (largely)
like their predecessors. `KMeansParallel` implements the K-Means || initialization algorithm.
 `KMeansRandom` implements the K-Means Random initialization algorithm.  `MultiKMeans` implements
the K-Means algorithm on multiple sets of cluster centers using a given distance function.
 Traits for the initializer, `KMeansInitializer`, and the general K-Means clusterer, `MultiKMeansClusterer`,
are provided to highlight the salient interfaces with the intent that alternative implementations
of these interfaces may be provided in the future.
    
    ### Performance
    
    This pull request is not focused on performance. Nevertheless,  the performance of the
KMeans++ implementation was *dramatically* improved by NOT recomputing distances to clusters
centers that were present in previous steps.  This turns a quadratic implementation into a
linear one. 
    
    Second, the KMeans++ implementation uses the general K-Means clusterer in the final step.
 This parallelizes a step that was sequential.
    
    Together, these changes address SPARK-3424.
    
    ### Next Steps
    
    This pull request does not introduce new user-visible changes.  The next step is to make
different distance functions available through a user-visible API.   I will provide other
distance functions after this pull request has been accepted. Then, we can decide on an appropriate
user-level API to access those functions.
    
    ### Compatibility
    
    While there are no user-level API changes, the behavior of the clusterer is *different*
on some tests.  Specifically, the handling of empty clusters has changed.  Empty clusters
are not filled with random points in this implementation.  The former behavior is undesirable
for a number a reasons, not the least of which is that there is no reasonable use for duplicate
cluster centers. To accommodate the change in behavior, the test cases were changed accordingly.
This addresses SPARK-3261.
    
    The private K-Means constructor which was used by some test Java code and one example
was replaced with a Scala constructor that is not Java friendly.  Since the constructor was
not user visible, I simply changed the Java test code and the example to use the higher level
interface.
    
    ### Testing
    
    This code has been tested (albeit while packaged outside of Spark) and performance measured
on data sets of millions of features each with hundreds of dimensions and on tens of thousands
of clusters.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/derrickburns/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2634.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2634
    
----
commit fbfdcd8946d8681c062ba08d71eac961db50417d
Author: Derrick Burns <derrickrburns@gmail.com>
Date:   2014-10-02T20:01:18Z

    This commit introduces a general distance function trait, PointOps, for the Spark K-Means
clusterer. There are no public API changes*.
    
    Distance Function Trait
    
    The PointOps trait defines the distance function. The PointOps trait is more than a simple
distance function. It also defines the types of Points and Centers for the clusterer. Standard
MLLIB Vectors are converted into Points and Centers. In the case of the FastEuclideanOps implementation
of PointOps, the Point and Center types includes vector norm members. In other distance functions
such as the Kullback-Leibler distance function, the Point and Center types include different
values that speed up the distance calculation in a similar way that caching vector norms speeds
up the Euclidean distance function. This addresses SPARK-3219.
    
    Refactoring
    
    To understand this original code, I found it useful to refactor the original implementation
into components. You may find it helpful to understand this pull request by looking at the
new components and comparing them to their original implementation. Unfortunately, GitHub
diff does not help very much with this.
    
    This commit splits up the clusterer into a number of components which behave (largely)
like their predecessors. KMeansParallel implements the K-Means || initialization algorithm.
KMeansRandom implements the K-Means Random initialization algorithm. MultiKMeans implements
the K-Means algorithm on multiple sets of cluster centers using a given distance function.
Traits for the initializer, KMeansInitializer, and the general K-Means clusterer, MultiKMeansClusterer,
are provided to highlight the salient interfaces with the intent that alternative implementations
of these interfaces may be provided in the future.
    
    Performance
    
    This pull request is not focused on performance. Nevertheless, the performance of the
KMeans++ implementation was dramatically improved by NOT recomputing distances to clusters
centers that were present in previous steps. This turns a quadratic implementation into a
linear one.
    
    Second, the KMeans++ implementation uses the general K-Means clusterer in the final step.
This parallelizes a step that was sequential.
    
    Together, these changes address SPARK-3424.
    
    Next Steps
    
    This pull request does not introduce new user-visible changes. The next step is to make
different distance functions available through a user-visible API. I will provide other distance
functions after this pull request has been accepted. Then, we can decide on an appropriate
user-level API to access those functions.
    
    Compatibility
    
    While there are no user-level API changes, the behavior of the clusterer is different
on some tests. Specifically, the handling of empty clusters has changed. Empty clusters are
not filled with random points in this implementation. The former behavior is undesirable for
a number a reasons, not the least of which is that there is no reasonable use for duplicate
cluster centers. To accommodate the change in behavior, the test cases were changed accordingly.
This addresses SPARK-3261.
    
    The private K-Means constructor which was used by some test Java code and one example
was replaced with a Scala constructor that is not Java friendly. Since the constructor was
not user visible, I simply changed the Java test code and the example to use the higher level
interface.
    
    Testing
    
    This code has been tested (albeit while packaged outside of Spark) and performance measured
on data sets of millions of features each with hundreds of dimensions and on tens of thousands
of clusters.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message