hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harsh J (Resolved) (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HDFS-59) No recovery when trying to replicate on marginal datanode
Date Thu, 29 Dec 2011 12:35:31 GMT

     [ https://issues.apache.org/jira/browse/HDFS-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Harsh J resolved HDFS-59.
-------------------------

    Resolution: Not A Problem

This has gone stale. We haven't seen this lately. Lets file a new one if we see this again
(these days it errors out with 'Could only replicate to X nodes' kinda errors).

Also, could've been your dfs.replication.min > 1.
                
> No recovery when trying to replicate on marginal datanode
> ---------------------------------------------------------
>
>                 Key: HDFS-59
>                 URL: https://issues.apache.org/jira/browse/HDFS-59
>             Project: Hadoop HDFS
>          Issue Type: Bug
>         Environment: Sep 14 nightly build with a couple of mapred-related patches
>            Reporter: Christian Kunz
>
> We have been uploading a lot of data to hdfs, running about 400 scripts in parallel calling
hadoop's command line utility in distributed fashion. Many of them started to hang when copying
large files (>120GB), repeating the following messages without end:
> 07/10/05 15:44:25 INFO fs.DFSClient: Could not complete file, retrying...
> 07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying...
> 07/10/05 15:44:26 INFO fs.DFSClient: Could not complete file, retrying...
> 07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying...
> 07/10/05 15:44:27 INFO fs.DFSClient: Could not complete file, retrying...
> 07/10/05 15:44:28 INFO fs.DFSClient: Could not complete file, retrying...
> In the namenode log I eventually found repeated messages like:
> 2007-10-05 14:40:08,063 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_3124504920241431462
> 2007-10-05 14:40:11,876 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer:
ask <IP4>50010 to replicate blk_3124504920241431462 to datanode(s) <IP4_1>:50010
> 2007-10-05 14:45:08,069 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_8533614499490422104
> 2007-10-05 14:45:08,070 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_7741954594593177224
> 2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer:
ask <IP4>:50010 to replicate blk_7741954594593177224 to datanode(s) <IP4_2>:50010
> 2007-10-05 14:45:13,973 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer:
ask <IP4>:50010 to replicate blk_8533614499490422104 to datanode(s) <IP4_3>50010
> I could not ssh to the  node with IpAdress <IP4>, but seemingly the datanode server
still sent heartbeats. After rebooting the node it  was okay again and a few files and a few
clients recovered, but not all.
> I restarted these clients and they completed this time (before noticing the marginal
node we restarted the clients twice without success).
> I would conclude that the existence of the marginal node must have caused loss of blocks,
at least in the tracking mechanism, in addition to eternal retries.
> In summary, dfs should be able to handle datanodes with good heartbeat but otherwise
failing to do their job. This should include datanodes that have a high rate of socket connection
timeouts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message