hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1056) Multi-node RPC deadlocks during block recovery
Date Sun, 21 Mar 2010 23:21:27 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848002#action_12848002
] 

Todd Lipcon commented on HDFS-1056:
-----------------------------------

I think I understand what's happening here. I am restarting an HDFS cluster underneath an
HBase cluster, and the following events transpire:
# DN with xceiver port X1 and ipcPort 11071 goes down
# DN starts back up with different xceiver port X2 but same ipcPort 11071
# Client calls recoverBlock, and since the ipcPort is the same, hits the new DN
# The DN (now known as 1.2.3.4:X2) sees that the target is 1.2.3.4:X1, which it decides is
not local. It then connects to itself via RPC rather than using the "direct invocation" shortcut
added by HADOOP-3673

To verify this, I added a log message when creating the InterDataNodeProtocol Proxy:
2010-03-21 16:02:33,378 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Creating IDNPP
for non-local id 192.168.42.40:50397 (dnReg=DatanodeRegistration(192.168.42.40:39786, storageID=DS-126683980-192
.168.42.40-41424-1269146536997, infoPort=40813, ipcPort=11071))

(dnReg is the new local DN)

I think the solution may be to determine the "equality" of the DNs based on IP and ipcPort,
not by name (which is the xceiver port). There may be issues with this, though - have to think
through it more thoroughly.

> Multi-node RPC deadlocks during block recovery
> ----------------------------------------------
>
>                 Key: HDFS-1056
>                 URL: https://issues.apache.org/jira/browse/HDFS-1056
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>    Affects Versions: 0.20.2, 0.21.0, 0.22.0
>            Reporter: Todd Lipcon
>
> Believe it or not, I'm seeing HADOOP-3657 / HADOOP-3673 in a 5-node 0.20 cluster. I have
many concurrent writes on the cluster, and when I kill a DN, some percentage of the time I
get one of these cross-node deadlocks among 3 of the nodes (replication 3). All of the DN
RPC server threads are tied up waiting on RPC clients to other datanodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message