hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Kunz (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-2763) Replication Monitor timing out repeatedly
Date Fri, 01 Feb 2008 07:07:07 GMT
Replication Monitor timing out repeatedly
-----------------------------------------

                 Key: HADOOP-2763
                 URL: https://issues.apache.org/jira/browse/HADOOP-2763
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.16.0
         Environment: Jan 28 nightly build

With patches 2095, 2119, and 2723
            Reporter: Christian Kunz


I upgraded a Hadoop installation to the Jan 28 nightly build.
DFS contains 5+ M files.

Fsck reported 1 hour after leaving safemode, 5274 under-replicated blocks with 25 single replications,
3 hours later 433 under-replicated with still 20 single replications.

The namenode log shows repeated timeouts of the replication monitor for the same blocks:

2008-02-01 03:41:24,184 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer:
ask datanode to replicate blk_2984271423661664080 to datanode(s) datanode1 datanode2
2008-02-01 03:51:14,104 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_2984271423661664080
2008-02-01 03:51:22,303 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer:
ask datanode to replicate blk_2984271423661664080 to datanode(s) datanode3 datanode4
2008-02-01 04:01:14,150 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_2984271423661664080
2008-02-01 04:01:19,344 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer:
ask datanode to replicate blk_2984271423661664080 to datanode(s) datanode5 datanode6
...

The datanode seems to be successfully transmitting the blocks:

2008-02-01 03:42:06,284 INFO org.apache.hadoop.dfs.DataNode: datanode Starting thread to transfer
block blk_2984271423661664080 to datanode1, datannode2
2008-02-01 03:42:09,535 INFO org.apache.hadoop.dfs.DataNode: datanode:Transmitted block blk_2984271423661664080
to /datanode1

2008-02-01 03:52:06,238 INFO org.apache.hadoop.dfs.DataNode: datanode Starting thread to transfer
block blk_2984271423661664080 to datanode3,datanode4
2008-02-01 03:52:09,470 INFO org.apache.hadoop.dfs.DataNode: datanode:Transmitted block blk_2984271423661664080
to /datanode3


The destination datanodes seem to have problems receiving these blocks (some time later for
a different attempt):

2008-02-01 06:43:06,541 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_2984271423661664080
from /datanode
2008-02-01 06:43:09,647 INFO org.apache.hadoop.dfs.DataNode: Exception in receiveBlock for
block blk_2984271423661664080 java.net.SocketException: Connection reset
2008-02-01 06:43:09,647 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_2984271423661664080
received exception java.net.SocketException: Connection reset

But I was successfully transferring the block between the two datanodes using scp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message