hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chun Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry
Date Tue, 27 Jan 2015 04:03:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292937#comment-14292937
] 

Chun Chen commented on YARN-2992:
---------------------------------

[~kasha] [~rohithsharma] [~jianhe], we are constantly facing the following error
RM log
{code}
2015-01-27 00:13:19,379 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to
server 10.196.128.13/10.196.128.13:2181. Will not attempt to authenticate using SASL (unknown
erro
r)
2015-01-27 00:13:19,383 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to 10.196.128.13/10.196.128.13:2181, initiating session
2015-01-27 00:13:19,404 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
on server 10.196.128.13/10.196.128.13:2181, sessionid = 0x24ab193421e4812, negotiated timeout
= 
10000
2015-01-27 00:13:19,417 WARN org.apache.zookeeper.ClientCnxn: Session 0x24ab193421e4812 for
server 10.196.128.13/10.196.128.13:2181, unexpected error, closing socket connection and attempti
ng reconnect
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
2015-01-27 00:13:19,517 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:895)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:892)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1031)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1050)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:898)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.access$600(ZKRMStateStore.java:82)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread.run(ZKRMStateStore.java:1003)
2015-01-27 00:13:19,518 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
Retrying operation on ZK. Retry no. 934
{code}

ZK log
{code}
2015-01-27 00:13:19,300 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197]
- Accepted socket connection from /10.240.92.100:46464
2015-01-27 00:13:19,302 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861]
- Client attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46464
2015-01-27 00:13:19,302 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108]
- Revalidating client: 0x24ab193421e4812
2015-01-27 00:13:19,303 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617]
- Established session 0x24ab193421e4812 with negotiated timeout 10000 for client /10.240.92.100:46464
2015-01-27 00:13:19,303 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892]
- got auth packet /10.240.92.100:46464
2015-01-27 00:13:19,303 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926]
- auth success /10.240.92.100:46464
2015-01-27 00:13:19,320 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362]
- Exception causing close of session 0x24ab193421e4812 due to java.io.IOException: Len error
1425415
2015-01-27 00:13:19,321 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007]
- Closed socket connection for client /10.240.92.100:46464 which had sessionid 0x24ab193421e4812
2015-01-27 00:13:23,093 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197]
- Accepted socket connection from /10.240.92.100:46477
2015-01-27 00:13:23,159 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861]
- Client attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46477
2015-01-27 00:13:23,159 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108]
- Revalidating client: 0x24ab193421e4812
2015-01-27 00:13:23,160 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617]
- Established session 0x24ab193421e4812 with negotiated timeout 10000 for client /10.240.92.100:46477
2015-01-27 00:13:23,160 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892]
- got auth packet /10.240.92.100:46477
2015-01-27 00:13:23,160 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926]
- auth success /10.240.92.100:46477
2015-01-27 00:13:23,170 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362]
- Exception causing close of session 0x24ab193421e4812 due to java.io.IOException: Len error
1425415
2015-01-27 00:13:23,171 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007]
- Closed socket connection for client /10.240.92.100:46477 which had sessionid 0x24ab193421e4812
{code}

It seems when zk receives a large packet ( Len error 1425415 > 1M) and then initiatively
closed the socket connection due to the large packet error. Haven't tried the patch here which
renews a zk connection each time it fails to execute a zk operation, but I can't figure out
why this happens, since VerifyActiveStatusThread only creates and deletes the fencing node.
The packet shouldn't be larger than 1M. Do you have any clues why this happens? Thanks.

> ZKRMStateStore crashes due to session expiry
> --------------------------------------------
>
>                 Key: YARN-2992
>                 URL: https://issues.apache.org/jira/browse/YARN-2992
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>             Fix For: 2.7.0
>
>         Attachments: yarn-2992-1.patch
>
>
> We recently saw the RM crash with the following stacktrace. On session expiry, we should
gracefully transition to standby. 
> {noformat}
> 2014-12-18 06:28:42,689 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED.
Cause: 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired 
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
> at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)

> at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)

> at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)

> at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)

> at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)

> at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941)

> at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958)

> at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687)

> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message