spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aarondav <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-3923] Decrease Akka heartbeat interval ...
Date Mon, 13 Oct 2014 18:21:03 GMT
GitHub user aarondav opened a pull request:

    https://github.com/apache/spark/pull/2784

    [SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause

    Something about the 2.3.4 upgrade seems to have made the issue manifest where all the
services disconnect from each other after exactly 1000 seconds (which is the heartbeat interval).
[This post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs) suggests that heartbeat
pause should be less than heartbeat interval, and decreasing the interval from 1000s to below
the 600s of the heartbeat pause seems to have rectified the issue. My current cluster has
now exceeded 1400s of uptime without failure!
    
    I do not know why this fixed it, because the threshold we have set for the failure detector
is the exponent of a timeout, and 300 is extremely large. Perhaps the default failure detector
changed in 2.3.4 and now ignores threshold.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aarondav/spark fix-timeout

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2784.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2784
    
----
commit 9cb03722d689de4da6f46e45609b1e1c6d40d130
Author: Aaron Davidson <aaron@databricks.com>
Date:   2014-10-13T18:14:03Z

    [SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause
    
    Something about the 2.3.4 upgrade seems to have made the issue manifest where
    all the services disconnect from each other after exactly 1000 seconds (which
    is the heartbeat interval). [This post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs)
    suggests that heartbeat pause should be less than heartbeat interval, and decreasing
    the interval from 1000s to below the 600s of the heartbeat pause seems to have
    rectified the issue. My current cluster has now exceeded 1400s of uptime without
    failure!
    
    I do not know why this fixed it, because the threshold we have set for the
    failure detector is the exponent of a timeout, and 300 is extremely large.
    Perhaps the default failure detector changed in 2.3.4 and now ignores
    threshold.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message