spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mengxr <>
Subject [GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Date Thu, 08 Jan 2015 23:55:01 GMT
Github user mengxr commented on the pull request:
    @derrickburns I like the improvements implemented in this PR. But as @srowen mentioned,
we have to resolve conflicts with the master branch before we can merge any PR. I compared
the performance of this PR with master on minist-digits (60000x784, sparse, 10 clusters) locally
and found the master runs 2-3x faster. I guess this is majorly caused by two changes.
    1. We replaced breeze operations by our own implementation. The latter is about 2-3x faster.
    1. Running k-means++ distributively has noticeable overhead with small k and feature dimension.
    I think it is still feasible to include features through separate PRs:
    1. remember previously computed best distances in k-means++ initialization
    1. allow fixing the random seed (addressed in #3610)
    1. variable number of clusters. We should discuss whether we want to have less than k
clusters or split the biggest one if there are more than k points. 
    1. parallelize k-means++. I think whether we should replace local k-means++ or make it
configurable requires some discussion and performance comparison.
    1. support Bregman divergences
    Putting all of them together would certainly delay the review process and require resolving
conflicts. I may have some time to prepare PRs for some of the features here, if you don't
    For Bregman divergences, I'm thinking we can alter the formulation to support sparse vectors:
    d(x, y) = f(x)  - f(y) - <x - y, g(y)> = f(x) - (f(y) - <y, g(y)>) - <x,
    where `f(x)`, `g(y)`, and `f(y) - <y, g(y)>` could be pre-computed and cached, and
`<x, g(y)>` can take advantage of sparse `x`. But I'm not sure whether this formulation
is really useful on any Bregman divergence rather than the squared distance and the Mahalanobis
distance. For KL-divergence and generalized I-divergence, the domain is R^d_+ and hence the
points cannot be sparse.
    Besides those comments, I'm going to make some minor comments inline.

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message