hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Noguchi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1371) One bad node can incorrectly flag many files as corrupt
Date Fri, 03 Sep 2010 16:25:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905952#action_12905952
] 

Koji Noguchi commented on HDFS-1371:
------------------------------------

(you guys are too fast.  I wanted the description to be short and was going to paste the logs
afterwards... )

Picking one such file: /myfile/part-00145.gz blk_-1426587446408804113_970819282

Namenode log showing
{noformat}
2010-08-31 10:47:56,258 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap:
blk_-1426587446408804113 added as corrupt on ZZ.YY.XX..220:1004 by /ZZ.YY.XX.246
2010-08-31 10:47:56,290 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap:
blk_-1426587446408804113 added as corrupt on ZZ.YY.XX..252:1004 by /ZZ.YY.XX.246
2010-08-31 10:47:56,489 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap:
blk_-1426587446408804113 added as corrupt on ZZ.YY.XX..107:1004 by /ZZ.YY.XX.246
2010-08-31 10:49:00,508 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap:
duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.252:1004 by
/ZZ.YY.XX.246
2010-08-31 10:49:00,554 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap:
duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.107:1004 by
/ZZ.YY.XX.246
2010-08-31 10:49:03,934 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap:
duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.220:1004 by
/ZZ.YY.XX.246
2010-08-31 10:49:03,949 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap:
duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.252:1004 by
/ZZ.YY.XX.246
2010-08-31 10:49:03,971 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap:
duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.107:1004 by
/ZZ.YY.XX.246
2010-08-31 10:49:07,986 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap:
duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.252:1004 by
/ZZ.YY.XX.246
2010-08-31 10:49:08,257 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap:
duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.220:1004 by
/ZZ.YY.XX.246
2010-08-31 10:49:08,895 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap:
duplicate requested for blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.107:1004 by
/ZZ.YY.XX.246
{noformat}

User Tasklogs on ZZ.YY.XX.246 showing
{noformat}
[root@ZZ.YY.XX.246 ~]# find /my/mapred/userlogs/ -type f -exec grep 1426587446408804113 \{\}
\; -print
org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_-1426587446408804113:of:/myfile/part-00145.gz
at 222720
2010-08-31 10:47:56,256 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_-1426587446408804113_970819282
from ZZ.YY.XX.220:1004 at 222720
org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_-1426587446408804113:of:/myfile/part-00145.gz
at 103936
2010-08-31 10:47:56,284 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_-1426587446408804113_970819282
from ZZ.YY.XX.252:1004 at 103936
org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_-1426587446408804113:of:/myfile/part-00145.gz
at 250368
2010-08-31 10:47:56,464 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_-1426587446408804113_970819282
from ZZ.YY.XX.107:1004 at 250368
2010-08-31 10:47:56,490 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-1426587446408804113_970819282
from any node: java.io.IOException: No live nodes contain current block. Will get new block
locations from namenode and retry...
{noformat}

This was consistent among all the 12 files reported as corrupt.  All from the same node ZZ.YY.XX.246.


When trying to pull this file from other healthy node, to my surprise it didn't fail.

{noformat}
[knoguchi@gwgd4003 ~]$ hadoop dfs -ls /myfile/part-00145.gz
Found 1 items
-rw-r--r--   3 user1 users   67771377 2010-08-31 06:46 /myfile/part-00145.gz

[knoguchi@gwgd4003 ~]$ hadoop fsck /myfile/part-00145.gz
.
/myfile/part-00145.gz: CORRUPT block blk_-1426587446408804113
Status: CORRUPT
 Total size:    67771377 B
 Total dirs:    0
 Total files:   1
 Total blocks (validated):      1 (avg. block size 67771377 B)
  ********************************
  CORRUPT FILES:        1
  CORRUPT BLOCKS:       1
  ********************************
 Minimally replicated blocks:   1 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0
 Corrupt blocks:                1
 Missing replicas:              0 (0.0 %)


The filesystem under path '/myfile/part-00145.gz' is CORRUPT
[knoguchi@gwgd4003 ~]$
[knoguchi@gwgd4003 ~]$ hadoop dfs -get /myfile/part-00145.gz /tmp
[knoguchi@gwgd4003 ~]$ echo $?
0
[knoguchi@gwgd4003 ~]$ ls -l /tmp/part-00145.gz
-rw-r--r-- 1 knoguchi users 67771377 Sep  2 21:04 /tmp/part-00145.gz
[knoguchi@gwgd4003 ~]$
{noformat}




> One bad node can incorrectly flag many files as corrupt
> -------------------------------------------------------
>
>                 Key: HDFS-1371
>                 URL: https://issues.apache.org/jira/browse/HDFS-1371
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client, name-node
>    Affects Versions: 0.20.1
>         Environment: yahoo internal version 
> [knoguchi@gwgd4003 ~]$ hadoop version
> Hadoop 0.20.104.3.1007030707
>            Reporter: Koji Noguchi
>
> On our cluster, 12 files were reported as corrupt by fsck even though the replicas on
the datanodes were healthy.
> Turns out that all the replicas (12 files x 3 replicas per file) were reported corrupt
from one node.
> Surprisingly, these files were still readable/accessible from dfsclient (-get/-cat) without
any problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message