hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsuyoshi Ozawa (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
Date Fri, 27 Feb 2015 15:20:04 GMT

    [ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340265#comment-14340265
] 

Tsuyoshi Ozawa commented on YARN-2820:
--------------------------------------

[~zxu] Thank you for updating. I rethink abut closeInternal(). If we call fs.close() twice
or more, it can close another file descriptor unexpectedly. It can lead unexpected behaviours.
We should remove closeWithRetries and call fs.close() in closeInternal() to avoid the problems.
What do you think? Thank you for dealing with iterative reviews.

> Do retry in FileSystemRMStateStore for better error recovery when update/store failure
due to IOException.
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-2820
>                 URL: https://issues.apache.org/jira/browse/YARN-2820
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.5.0, 2.6.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>         Attachments: YARN-2820.000.patch, YARN-2820.001.patch, YARN-2820.002.patch, YARN-2820.003.patch,
YARN-2820.004.patch, YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, YARN-2820.007.patch
>
>
> Do retry in FileSystemRMStateStore for better error recovery when update/store failure
due to IOException.
> When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following
IOexception cause the RM shutdown.
> {code}
> 2014-10-29 23:49:12,202 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
> Updating info for attempt: appattempt_1409135750325_109118_000001 at: 
> /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
> appattempt_1409135750325_109118_000001
> 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete
> /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
> appattempt_1409135750325_109118_000001.new.tmp retrying...
> 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete
> /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
> appattempt_1409135750325_109118_000001.new.tmp retrying...
> 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete
> /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
> appattempt_1409135750325_109118_000001.new.tmp retrying...
> 2014-10-29 23:49:46,283 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
> Error updating info for attempt: appattempt_1409135750325_109118_000001
> java.io.IOException: Unable to close file because the last block does not have enough
number of replicas.
> 2014-10-29 23:49:46,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
> Error storing/updating appAttempt: appattempt_1409135750325_109118_000001
> 2014-10-29 23:49:46,916 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
> Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED.
Cause: 
> java.io.IOException: Unable to close file because the last block does not have enough
number of replicas. 
> at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) 
> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
> at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)

> at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
> at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)

> at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)

> at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
> at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)

> at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)

> at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)

> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) 
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:744) 
> {code}
> As discussed at YARN-1778, TestFSRMStateStore failure is also due to  IOException in
storeApplicationStateInternal.
> Stack trace from TestFSRMStateStore failure:
> {code}
>  2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285))
- testFSRMStateStoreClientRetry: Exception
>  org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started
>        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
>        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
>        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
>       at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:415)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128)
>        at org.apache.hadoop.ipc.Client.call(Client.java:1474)
>        at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>        at com.sun.proxy.$Proxy23.mkdirs(Unknown Source)
>        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:557)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>        at java.lang.reflect.Method.invoke(Method.java:606)
>        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
>        at com.sun.proxy.$Proxy24.mkdirs(Unknown Source)
>        at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2991)
>        at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2961)
>        at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:973)
>        at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:969)
>        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>        at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:969)
>        at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:962)
>        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1869)
>        at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.storeApplicationStateInternal(FileSystemRMStateStore.java:364)
>        at org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore$2.run(TestFSRMStateStore.java:273)
>  {code}
>  It will be better to  Improve FileSystemRMStateStore to do retry for better error recovery
when update/store failure



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message