spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markhamstra <...@git.apache.org>
Subject [GitHub] incubator-spark pull request: SPARK-1122: allCollect functions for...
Date Sun, 23 Feb 2014 07:40:28 GMT
Github user markhamstra commented on the pull request:

    https://github.com/apache/incubator-spark/pull/635#issuecomment-35825888
  
    Huh?  I don't get the point of these at all.
    
    At first glance, allCollect looks like a really bad idea.  Collecting the entire contents
of an RDD to the driver process only to immediately turn around and push all of that data
(or in this case, multiple copies of the data!) back across the network is an anti-pattern
and generally a very poor design choice that cannot scale to large data -- if you can handle
all of the data within the driver process, then why are you using a distributed, big-data
framework in the first place?
    
    allCollectBroadcast makes even less sense to me.  Some workflows do demand collecting
a relatively small amount of data to the driver and then broadcasting a small amount back
to the workers for use in further computations, but why would I then want to go through the
extra step of pushing the broadcast values into a strange-looking RDD instead of just using
the broadcast variable directly?
    
    It's going to take a lot of persuading to convince me that either of these are things
we want to promote and support in the 1.0 API.  That doesn't mean that I'm not listening,
but I am far from convinced at this point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

Mime
View raw message