flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ufuk Celebi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-10193) Default RPC timeout is used when triggering savepoint via JobMasterGateway
Date Wed, 22 Aug 2018 08:04:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-10193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588512#comment-16588512
] 

Ufuk Celebi commented on FLINK-10193:
-------------------------------------

[~gjy] I've changed the type of ticket from {{Improvement}} to {{Bug}} as this results in
savepoints that take longer than the default ask timeout to be reported as {{COMPLETED}} with
a {{failure-cause}} although the actual savepoint completes successfully:
{code}
{
  "status": {
    "id": "COMPLETED"
  },
  "operation": {
    "failure-cause": {
      "class": "java.util.concurrent.CompletionException",
      "stack-trace": "java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException:
Ask timed out on [Actor[akka://flink/user/jobmanager_0#42163687]] after [30000 ms]. Sender[null]
sent message of type \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\".\n\tat java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)\n\tat
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)\n\tat java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)\n\tat
java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)\n\tat
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)\n\tat java.util.concurrent.CompletableFuture.completeExceptionally...<ommitted
for brevity>"
    }
  }
}
{code}
This means that we can't use the REST API to reliably trigger savepoints. I've verified this
with a small program that blocks during checkpoints for a configurable amount of time. 

The only workaround as far as I know is to increase {{akka.ask.timeout}} (although I would
not recommend this as it affects other things as well). Note that increasing {{web.timeout}}
does not affect this.

> Default RPC timeout is used when triggering savepoint via JobMasterGateway
> --------------------------------------------------------------------------
>
>                 Key: FLINK-10193
>                 URL: https://issues.apache.org/jira/browse/FLINK-10193
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.3, 1.6.0
>            Reporter: Gary Yao
>            Assignee: Gary Yao
>            Priority: Critical
>
> When calling {{JobMasterGateway#triggerSavepoint(String, boolean, Time)}}, the default
timeout is used because the time parameter of the method  is not annotated with {{@RpcTimeout}}.

> *Expected behavior*
> * timeout for the RPC should be {{RpcUtils.INF_TIMEOUT}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message