hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Kunz (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-1998) No recovery when trying to replicate on marginal datanode
Date Fri, 05 Oct 2007 18:45:51 GMT
No recovery when trying to replicate on marginal datanode
---------------------------------------------------------

                 Key: HADOOP-1998
                 URL: https://issues.apache.org/jira/browse/HADOOP-1998
             Project: Hadoop
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.15.0
         Environment: Sep 14 nightly build with a couple of mapred-related patches
            Reporter: Christian Kunz


We have been uploading a lot of data to hdfs, running about 400 scripts in parallel calling
hadoop's command line utility in distributed fashion. Many of them started to hang when copying
large files (>120GB), repeating the following messages without end:

07/10/05 15:44:25 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying...
07/10/05 15:44:28 INFO fs.DFSClient: Could not complete file, retrying...

In the namenode log I eventually found repeated messages like:

2007-10-05 14:40:08,063 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_3124504920241431462
2007-10-05 14:40:11,876 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer:
ask <IP4>50010 to replicate blk_3124504920241431462 to datanode(s) <IP4_1>:50010
2007-10-05 14:45:08,069 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_8533614499490422104
2007-10-05 14:45:08,070 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_7741954594593177224
2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer:
ask <IP4>:50010 to replicate blk_7741954594593177224 to datanode(s) <IP4_2>:50010
2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer:
ask <IP4>:50010 to replicate blk_8533614499490422104 to datanode(s) <IP4_3>50010

I could not ssh to the  node with IpAdress <IP4>, but seemingly the datanode server
still sent heartbeats. After rebooting the node it  was okay again and a few files and a few
clients recovered, but not all.
I restarted these clients and they completed this time (before noticing the marginal node
we restarted the clients twice without success).

I would conclude that the existence of the marginal node must have caused loss of blocks,
at least in the tracking mechanism, in addition to eternal retries.

In summary, dfs should be able to handle datanodes with good heartbeat but otherwise failing
to do their job. This should include datanodes that have a high rate of socket connection
timeouts.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message