flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aljoscha Krettek (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-3431) Add retrying logic for RocksDB snapshots
Date Mon, 03 Apr 2017 12:12:41 GMT

     [ https://issues.apache.org/jira/browse/FLINK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Aljoscha Krettek updated FLINK-3431:
    Component/s:     (was: Streaming)
                 State Backends, Checkpointing
                 DataStream API

> Add retrying logic for RocksDB snapshots
> ----------------------------------------
>                 Key: FLINK-3431
>                 URL: https://issues.apache.org/jira/browse/FLINK-3431
>             Project: Flink
>          Issue Type: Improvement
>          Components: DataStream API, State Backends, Checkpointing
>            Reporter: Gyula Fora
>            Priority: Critical
> Currently the RocksDB snapshots rely on hdfs copy not failing while taking the snapshots.
> In some cases when the state size is big enough the HDFS nodes might get so overloaded
that the copy operation fails on errors like this:
> AsynchronousException{java.io.IOException: All datanodes are bad.
> at org.apache.flink.streaming.runtime.tasks.StreamTask$1.run(StreamTask.java:545)
> Caused by: java.io.IOException: All datanodes are bad. Aborting...
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1023)
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:838)
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:483)
> I think it would be important that we don't immediately fail the job in these cases but
retry the copy operation after some random sleep time. It might be also good to do a random
sleep before the copy depending on the state size to smoothen out IO a little bit.

This message was sent by Atlassian JIRA

View raw message