spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nchammas <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-3398] [EC2] Have spark-ec2 intelligentl...
Date Sat, 13 Sep 2014 05:45:30 GMT
Github user nchammas commented on the pull request:

    https://github.com/apache/spark/pull/2339#issuecomment-55482806
  
    > Do you have any idea how long it takes to fork the sub-process and SSH into the machine?
    
    Ah, this is a valid concern. I've tested this with launching 50-node clusters, but not,
say, with a 500-node cluster. 
    
    `all()` is [short-circuit evaluated](http://bugs.python.org/issue17255), so [this line
of code](https://github.com/apache/spark/pull/2339/files#diff-ada66bbeb2f1327b508232ef6c3805a5R637)
will only fork one more process than the number of nodes that have SSH available. So in your
example, if I'm launching a 300-node cluster and only 10 of them have SSH available when I
test, I'll only fork 11 processes, assuming I'm lucky enough to hit the 10 nodes with SSH
available first.
    
    To be extra safe, I can rewrite this `all()` statement as an explicit loop since the short-circuiting
behavior is not guaranteed on Python 2.6.
    
    In addition to that, I can implement a simple, linear backoff on the SSH testing. For
example, test SSH every `3 * num_attempts` seconds.
    
    How does that sound? Hopefully not too complex.
    
    > And I'm not sure whether it's too big of a deal.
    
    This is definitely a convenience feature. But I can share from my own experience of regularly
spinning up 20-50 node clusters with `spark-ec2` that I often find myself restarting the launch
with `--resume` because SSH took too long to come online, or I find myself waiting impatiently
because I think I set `--wait` to too high a value. [Others](http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCANk3DLkzLt2WtUGo6OaVPw2CkGfBHkBMDjxxTr7_cCVhDB8Esg@mail.gmail.com%3E)
[have](http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3CCAPtvcLjVPMrRuCH2+_xkRSxp1-=u-OXp+c32sJ3RXxXVZw6OMA@mail.gmail.com%3E)
posted to the user list in confusion, thinking that something is broken, when it is just that
they didn't know to `--wait` long enough.
    
    It would be nice if `spark-ec2` just took care of this detail for the user.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message