flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lu Niu (Jira)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
Date Thu, 02 Apr 2020 00:12:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lu Niu updated FLINK-16931:
---------------------------
    Description: 
When _metadata file is big, JobManager could never recover from checkpoint. It fall into a
loop that fetch checkpoint -> JM timeout -> restart. Here is related log: 
{code:java}
 2020-04-01 17:08:25,689 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Recovering checkpoints from ZooKeeper.
 2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Found 3 checkpoints in ZooKeeper.
 2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Trying to fetch 3 checkpoints from storage.
 2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Trying to retrieve checkpoint 50.
 2020-04-01 17:08:48,589 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Trying to retrieve checkpoint 51.
 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The heartbeat of
JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out.
{code}
Digging into the code, looks like ExecutionGraph::restart runs in JobMaster main thread and
finally calls ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download
file form DFS. The main thread is basically blocked for a while because of this. One possible
solution is to making the downloading part async. More things might need to consider as the
original change tries to make it single-threaded. [https://github.com/apache/flink/pull/7568]

  was:
When _metadata file is big, JobManager could never recover from checkpoint. It fall into a
loop that fetch checkpoint -> JM timeout -> restart). Here is related log: 
{code:java}
 2020-04-01 17:08:25,689 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Recovering checkpoints from ZooKeeper.
 2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Found 3 checkpoints in ZooKeeper.
 2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Trying to fetch 3 checkpoints from storage.
 2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Trying to retrieve checkpoint 50.
 2020-04-01 17:08:48,589 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Trying to retrieve checkpoint 51.
 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The heartbeat of
JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out.
{code}

 Digging into the code, looks like ExecutionGraph::restart runs in JobMaster main thread and
finally calls ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download
file form DFS. The main thread is basically blocked for a while because of this. One possible
solution is to making the downloading part async. More things might need to consider as the
original change tries to make it single-threaded. [https://github.com/apache/flink/pull/7568]


> Large _metadata file lead to JobManager not responding when restart
> -------------------------------------------------------------------
>
>                 Key: FLINK-16931
>                 URL: https://issues.apache.org/jira/browse/FLINK-16931
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Lu Niu
>            Priority: Major
>
> When _metadata file is big, JobManager could never recover from checkpoint. It fall into
a loop that fetch checkpoint -> JM timeout -> restart. Here is related log: 
> {code:java}
>  2020-04-01 17:08:25,689 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Recovering checkpoints from ZooKeeper.
>  2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Found 3 checkpoints in ZooKeeper.
>  2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Trying to fetch 3 checkpoints from storage.
>  2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Trying to retrieve checkpoint 50.
>  2020-04-01 17:08:48,589 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Trying to retrieve checkpoint 51.
>  2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The heartbeat
of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out.
> {code}
> Digging into the code, looks like ExecutionGraph::restart runs in JobMaster main thread
and finally calls ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download
file form DFS. The main thread is basically blocked for a while because of this. One possible
solution is to making the downloading part async. More things might need to consider as the
original change tries to make it single-threaded. [https://github.com/apache/flink/pull/7568]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message