hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1955) Corrupted block replication retries for ever
Date Tue, 02 Oct 2007 21:28:51 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531919
] 

Raghu Angadi commented on HADOOP-1955:
--------------------------------------

bq. In my case, this infinite loop was started when one datanode went down and the namenode
started replicating. Does this mean, namenode will keep on trying until someone access the
file and notice that it's corrupted?

Yes, if there is no valid replica. In your case it is not clear if all the replicas are corrupted.

With this patch, Namenode will try all the remaining replicas for replicating a block.
If none of these succeed  (because all the replicas are corrupted), there is not much
Namenode can do about it. It will just keep on trying (evey 10 min) eventually someone
will notice the error.

In your case, if there is a good replica, it will be used in subsequent retries.



> Corrupted block replication retries for ever
> --------------------------------------------
>
>                 Key: HADOOP-1955
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1955
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.14.1
>            Reporter: Koji Noguchi
>            Assignee: Raghu Angadi
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1955.patch
>
>
> When replicating corrupted block, receiving side rejects the block due to checksum error.
Namenode keeps on retrying (with the same source datanode).
> Fsck shows those blocks as under-replicated.
> [Namenode log]
> {noformat} 
> 2007-09-27 02:00:05,273 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck:
lost heartbeat from 99.2.99.111
> ...
> 2007-09-27 02:01:02,618 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer:
ask 99.9.99.11:9999 to replicate blk_-5925066143536023890 to datanode(s) 99.9.99.37:9999
> 2007-09-27 02:10:03,843 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_-5925066143536023890
> 2007-09-27 02:10:08,248 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer:
ask 99.9.99.11:9999 to replicate blk_-5925066143536023890 to datanode(s) 99.9.99.35:9999
> 2007-09-27 02:20:03,848 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_-5925066143536023890
> 2007-09-27 02:20:08,646 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer:
ask 99.9.99.11:9999 to replicate blk_-5925066143536023890 to datanode(s) 99.9.99.19:9999
> (repeats)
> {noformat} 
> [Datanode(sender) 99.9.99.11 log]
> {noformat} 
> 2007-09-27 02:01:04,493 INFO org.apache.hadoop.dfs.DataNode: Starting thread to transfer
block blk_-5925066143536023890 to [Lorg.apache.hadoop.dfs.DatanodeInfo;@e58187
> 2007-09-27 02:01:05,153 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_-5925066143536023890
to 74.6.128.37:50010 got java.net.SocketException: Connection reset
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>   at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>   at java.io.DataOutputStream.write(DataOutputStream.java:90)
>   at org.apache.hadoop.dfs.DataNode.sendBlock(DataNode.java:1231)
>   at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1280)
>   at java.lang.Thread.run(Thread.java:619)
> (repeats)
> {noformat} 
> [Datanode(one of the receiver) 99.9.99.37 log]
> {noformat} 
> 2007-09-27 02:01:05,150 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException:
Unexpected checksum mismatch while writing blk_-5925066143536023890 from /74.6.128.33:57605
>   at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:902)
>   at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:727)
>   at java.lang.Thread.run(Thread.java:619)
> {noformat} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message