spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mengxr <>
Subject [GitHub] incubator-spark pull request: SPARK-1122: allCollect functions for...
Date Mon, 24 Feb 2014 00:19:44 GMT
Github user mengxr commented on the pull request:
    @markhamstra @pwendell For the use cases, this allCollect operation may be useful in the
grid search for a good set of training parameters for machine learning problems. For example,
if the dataset is only 500MB but training takes half an hour to finish and we have to try
100 different combinations of training parameters (e.g., rank, regularization constants, and
termination tolerance), the wall-clock time can be reduced by distributing the dataset to
multiple nodes and training in parallel. Another use case is the replicated join, though locality
issues need to be addressed. I agree with you that the implementation is not efficient, which
puts heavy load on the driver.
    @coderxiang , could you try to improve the implementation? 

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at or file a JIRA ticket with INFRA.

View raw message