hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rafal Wojdyla (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-6179) Synchronized
Date Tue, 01 Apr 2014 15:07:15 GMT
Rafal Wojdyla created HDFS-6179:
-----------------------------------

             Summary: Synchronized 
                 Key: HDFS-6179
                 URL: https://issues.apache.org/jira/browse/HDFS-6179
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode, namenode
    Affects Versions: 2.2.0
            Reporter: Rafal Wojdyla


Scenario:
* 600 ative DNs
* 1 *active* NN
* HA configuration

When we start SbNN because of huge number of blocks and relative small initialDelay - SbNN
during startup will go through multiple stop-the-world garbage collections processes (in minutes
- Namenode heap size is 75GB). We've observed that SbNN slowness affects active NN so active
NN is losing DNs (DNs are considered dead due to lack of heartbeats). We assume that some
DNs are hanging.

When DN is considered dead by active Namenode, we've observed "dead lock" in DN process, part
of stack trace:

{noformat}
"DataNode: [file:/disk1,file:/disk2]  heartbeating to standbynamenode.net/10.10.10.10:8020"
daemon prio=10 tid=0x00007ff429417800 nid=0x7f2a in Object.wait() [0x00007ff42122c000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.hadoop.ipc.Client.call(Client.java:1333)
        - locked <0x00000007db94e4c8> (a org.apache.hadoop.ipc.Client$Call)
        at org.apache.hadoop.ipc.Client.call(Client.java:1300)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
        at $Proxy9.registerDatanode(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at $Proxy9.registerDatanode(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:740)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromStandby(BPOfferService.java:603)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:506)
        - locked <0x0000000780006e08> (a org.apache.hadoop.hdfs.server.datanode.BPOfferService)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:704)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:539)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
        at java.lang.Thread.run(Thread.java:662)

"DataNode: [file:/disk1,file:/disk2]  heartbeating to activenamenode.net/10.10.10.11:8020"
daemon prio=10 tid=0x00007ff428a24000 nid=0x7f29 waiting for monitor entry [0x00007ff42132e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.updateActorStatesFromHeartbeat(BPOfferService.java:413)
        - waiting to lock <0x0000000780006e08> (a org.apache.hadoop.hdfs.server.datanode.BPOfferService)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:535)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
        at java.lang.Thread.run(Thread.java:662)
{noformat}

Notice that it's the same lock - due to synchronization at BPOfferService. The problem is
that command from standby can't be process due to unresponsive standby Namenode, nevertheless
DN is waiting for reply from SbNN, and is waiting long enough to be considered dead by active
namenode.

Info: if we kill SbNN, DN will instantly reconnect to active NN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message