spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <>
Subject [GitHub] [spark] vanzin opened a new pull request #26630: [SPARK-29965][core] Ensure that killed executors don't re-register with driver.
Date Thu, 21 Nov 2019 21:18:35 GMT
vanzin opened a new pull request #26630: [SPARK-29965][core] Ensure that killed executors don't
re-register with driver.
   There are 3 different issues that cause the same underlying problem: an executor
   that the driver has killed during downscaling registers back with the block
   manager in the driver, and the block manager from that point on keeps trying
   to contact the dead executor.
   The first one is that the heartbeat receiver was asking unknown executors to
   re-register when receiving a heartbeat. That code path only really happens
   when the executor dies because of a driver killing it, so there's no reason
   to re-register.
   The second one is a race between the heartbeat receiver and the DAG scheduler.
   Both received notifications of an executor's addition and removal
   asynchronously (the first one via the listener bus *and* an async local RPC,
   the second via its own separate internal message queue). This led to
   situations where they disagreed about which executors were really alive; the
   change makes it so the heartbeat receiver is updated first, and once that's
   done, then the DAG scheduler can update itself. This ensures the hearbeat
   receiver knows which executors not to ask to re-register.
   The third one is because the block manager couldn't differentiate between
   an unknown executor (like one that's been removed) and an executor that needs
   to re-register (like one the scheduler decided to unregister because of
   too many fetch failures). The change adds code in the block manager master to
   track which executors have been removed, so that instead of asking them to
   re-register, it just ignores them.
   While there I simplified the executor shutdown a bit since it was doing
   some stuff unnecessarily.
   Tested with existing unit tests, and by repeatedly runnins worklogs on k8s
   with dynamic allocation; previously I'd hit these different issues somewhat
   often, with the fixes I'm not able to reproduce them.

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message