flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-5830) OutOfMemoryError during notify final state in TaskExecutor may cause job stuck
Date Wed, 22 Feb 2017 03:07:44 GMT

    [ https://issues.apache.org/jira/browse/FLINK-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15877368#comment-15877368
] 

ASF GitHub Bot commented on FLINK-5830:
---------------------------------------

Github user zhijiangW commented on the issue:

    https://github.com/apache/flink/pull/3360
  
    Hi @tillrohrmann , thank you for reviews and positive suggestions!
    
    I try to explain the root case of this issue first:
    
    From JobMaster side, it sends the cancel rpc message and gets the acknowledge from TaskExecutor,
then the execution state transition to **CANCELING**.
    
    From TaskExecutor side, it would notify the final state to JobMaster before task exits.
The **notifyFinalState** can be divided into two steps:
    
    - Execute the **RunAsync** message by akka actor and this is a tell action, and it will
trigger to run **unregisterTaskAndNotifyFinalState**.
    - In process of **unregisterTaskAndNotifyFinalState**, it will trigger the rpc message
of **updateTaskExecutionState** , and it is a ask action, so the mechanism can avoid lost
message.
    
    The problem is that it may cause OOM before trigger **updateTaskExecutionState**, and
this error is caught by **AkkaRpcActor** and does not do anything resulting in interrupting
the following process. The **updateTaskExecutionState** will not be executed anymore.
    
    For the key point interaction between TaskExecutor and JobMaster, it should not tolerate
lost message, and I agree with your above suggestions. So there may be two ideas for this
improvement:
    
    - Enhance the robustness of **notifyFinalState**, and the current rethrow OOM is an easy
option, but it will cause the TaskExecutor exit´╝îthere should be other ways to make the cost
reduction.
    - After get cancel acknowledge in JobMaster side, it will trigger a timeout to check the
execution final state. If the execution has not entered the final state within timeout, the
JobMaster can resend the acknowledge message to TaskExecutor to confirm the status.
    
    And I prefers the first way to just make it sense in one side, avoid the complex interaction
between TaskExecutor and JobMaster. 
    
    Wish your further suggestions or any ideas.


> OutOfMemoryError during notify final state in TaskExecutor may cause job stuck
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-5830
>                 URL: https://issues.apache.org/jira/browse/FLINK-5830
>             Project: Flink
>          Issue Type: Bug
>            Reporter: zhijiang
>            Assignee: zhijiang
>
> The scenario is like this:
> {{JobMaster}} tries to cancel all the executions when process failed execution, and the
task executor already acknowledge the cancel rpc message.
> When notify the final state in {{TaskExecutor}}, it causes OOM in {{AkkaRpcActor}} and
this error is caught to log the info. The final state will not be sent any more.
> The {{JobMaster}} can not receive the final state and trigger the restart strategy.
> One solution is to catch the {{OutOfMemoryError}} and throw it, then it will cause to
shut down the {{ActorSystem}} resulting in exiting the {{TaskExecutor}}. The {{JobMaster}}
can be notified of {{TaskExecutor}} failure and fail all the tasks to trigger restart successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message