spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <>
Subject [GitHub] [spark] tgravescs commented on issue #26696: [WIP][SPARK-18886][CORE] Only reset scheduling delay timer if allocated slots are fully utilized
Date Tue, 03 Dec 2019 15:25:34 GMT
tgravescs commented on issue #26696: [WIP][SPARK-18886][CORE] Only reset scheduling delay timer
if allocated slots are fully utilized
   Can you please expand on the description?  It sounds like you are saying you fill in all
available slots no matter what locality and then reset the timer?  Which would essentially
bypass the locality wait on all tasks in the first "round" of scheduling, which is not what
we want.
   > , if its task duration is shorter than the locality wait time (3 seconds by default).
   It doesn't have to be strictly shorter then wait time as long the harmonics of when tasks
finish and get started is shorter then that. I've seen that happen before. 
   > A simple solution is: we never reset the timer. When a stage has been waiting long
enough for locality, this stage should not wait for locality anymore. However, this may hurt
performance if the last task is scheduled to a non-preferred location, and a preferred location
becomes available right after this task gets scheduled, 
   Your always going to have a race condition where if you wanted a bit longer you would have
got locality, there is no way around that.  I don't think just waiting a flat timeout makes
sense.  For instance, if I have 10,000 tasks you are saying you only wait 3 seconds (by default)
and then all of them can be scheduled.  If you only have 10 executors (with 1 core), only
10 can run in parallel that doesn't even try to get locality on the 9,990 other tasks.
   I've been thinking about options here and the one that seems the most straight forward
to me is making it delay scheduling on a "slot".  Here a "slot" is a place you can put a task.
 For example if I have 1 executor with 10 cores (default 1 cpu per task) then I have 10 slots.
  When a slot becomes available you set a timer to be whatever the currently locality level
is for that slot.  The reason to use slots and not entire executors is I could schedule 1
task on an executor with 10 slots and the other 9 would be wasted if you set the timer per
   When the timer goes off for that slot, you check to see if anything can be scheduled on
it at that locality, if not you set the timer again to fall back to the next locality level
(node -> rack).  
   This way you aren't wasting available resources.   If you really want your tasks to wait
for locality just set the timeout on those higher.
   Another option is per task, but I think that involves more tracking logic and will use
more memory.
   I think the per slot logic would work fine with the job scheduler scenario with the Fair
scheduler as well.  
   The way spark does it now just seems broken to me, even in the job scheduler case. At some
point you just want to run, if you don't again you can simply increase the timeout.  I'm going
to go re-read the jira's to make sure I'm not missing anything though.  Note if someone is
really worried about getting rid of existing logic we could keep it and have a config, but
personally think we shouldn't.

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message