hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
Date Mon, 17 Nov 2014 17:45:34 GMT

    [ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214895#comment-14214895
] 

Ming Ma commented on YARN-2862:
-------------------------------

Thanks, [~jira.shegalov], [~jianhe], [~zjshen].

I am able to repro the issue in trunk. a) pick an application in FileSystemRMStateStore; b)
run "cat /dev/null >  application_xxxx_yyyy" size; c) restart RM.

The corrupted .new file might be another issue. There is no .new file in this specific case
where the state file has been written or updated from RM point of view. However, it appears
the state file hasn't been flushed from OS to disk before the machine hard shutdown.

> RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-2862
>                 URL: https://issues.apache.org/jira/browse/YARN-2862
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Ming Ma
>
> This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario,
it might not be that important, unless there is something we need to fix at RM layer to make
it more tolerant to RMStore issue.
> When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored
application data end up with size zero after reboot. And RM didn't like that.
> {noformat}
> ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
> total 156
> drwxr-xr-x.    2 x y   4096 Nov 13 16:45 .
> drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
> -rw-r--r--.    1 x y      0 Nov 13 16:45 appattempt_1412702189634_324351_000001
> -rw-r--r--.    1 x y      0 Nov 13 16:45 .appattempt_1412702189634_324351_000001.crc
> -rw-r--r--.    1 x y      0 Nov 13 16:45 application_1412702189634_324351
> -rw-r--r--.    1 x y      0 Nov 13 16:45 .application_1412702189634_324351.crc
> {noformat}
> When RM starts up
> {noformat}
> 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum
file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
 Ignoring exception:
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:197)
>         at java.io.DataInputStream.readFully(DataInputStream.java:169)
>         at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146)
>         at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)
> ...
> 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Failed to load/recover state
> java.lang.NullPointerException
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message