hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Kunz (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-3392) Corrupted blocks leading to job failures
Date Wed, 14 May 2008 21:19:55 GMT
Corrupted blocks leading to job failures
----------------------------------------

                 Key: HADOOP-3392
                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
             Project: Hadoop Core
          Issue Type: Improvement
    Affects Versions: 0.16.0
            Reporter: Christian Kunz


On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors)
such that jobs were failing because of no live blocks available.

fsck reports the system as healthy, although it is not.

I argue that fsck should have an option to check whether under-replicated blocks are okay.

Even better, the namenode should automatically check under-replicated blocks with repeated
replication failures for corruption and list them somewhere on the GUI. And there should be
an option to undo the corruption and recompute the checksums.

Question: Is it at all probable that two or more replications of a block have checksum errors?
If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message