flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "JIN SUN (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-10644) Batch Job: Speculative execution
Date Wed, 07 Nov 2018 00:42:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

JIN SUN updated FLINK-10644:
----------------------------
    Description: 
Strugglers/outlier are tasks that run slower than most of the all tasks in a Batch Job, this
somehow impact job latency, as pretty much this straggler will be in the critical path of
the job and become as the bottleneck.

Tasks may be slow for various reasons, including hardware degradation, or software mis-configuration,
or noise neighboring. It's hard for JM to predict the runtime.

To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark has *_speculative
execution_*. Speculative execution is a health-check procedure that checks for tasks to be
speculated, i.e. running slower in a ExecutionJobVertex than the median of all successfully
completed tasks in that EJV, Such slow tasks will be re-submitted to another TM. It will not
stop the slow tasks, but run a new copy in parallel. And will kill the others if one of them
complete.

This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be append later.

  was:
Strugglers/outlier are tasks that run slower than most of the all tasks in a Batch Job, this
somehow impact job latency, as pretty much this straggler will be in the critical path of
the job and become as the bottleneck.

Tasks may be slow for various reasons, including hardware degradation, or software mis-configuration,
or noise neighboring. It's hard for JM to predict the runtime.

To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark has *_speculative
execution_*. Speculative execution is a health-check procedure that checks for tasks to be
speculated, i.e. running slower in a ExecutionJobVertex than the median of all successfully
completed tasks in that EJV, Such slow tasks will be re-submitted to another TM. It will not
stop the slow tasks, but run a new copy in parallel. And will kill the others if one of them
complete.

This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be append later.

 

the document contribute by is here: [https://docs.google.com/document/d/1X_Pfo4WcO-TEZmmVTTYNn44LQg5gnFeeaeqM7ZNLQ7M/edit] 


> Batch Job: Speculative execution
> --------------------------------
>
>                 Key: FLINK-10644
>                 URL: https://issues.apache.org/jira/browse/FLINK-10644
>             Project: Flink
>          Issue Type: New Feature
>          Components: JobManager
>            Reporter: JIN SUN
>            Assignee: JIN SUN
>            Priority: Major
>             Fix For: 1.8.0
>
>
> Strugglers/outlier are tasks that run slower than most of the all tasks in a Batch Job,
this somehow impact job latency, as pretty much this straggler will be in the critical path
of the job and become as the bottleneck.
> Tasks may be slow for various reasons, including hardware degradation, or software mis-configuration,
or noise neighboring. It's hard for JM to predict the runtime.
> To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark has *_speculative
execution_*. Speculative execution is a health-check procedure that checks for tasks to be
speculated, i.e. running slower in a ExecutionJobVertex than the median of all successfully
completed tasks in that EJV, Such slow tasks will be re-submitted to another TM. It will not
stop the slow tasks, but run a new copy in parallel. And will kill the others if one of them
complete.
> This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be append
later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message