Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <304191745.1210799995578.JavaMail.jira@brutus>
Date: Wed, 14 May 2008 14:19:55 -0700 (PDT)
From: "Christian Kunz (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Created: (HADOOP-3392) Corrupted blocks leading to job
 failures
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Corrupted blocks leading to job failures
----------------------------------------

                 Key: HADOOP-3392
                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
             Project: Hadoop Core
          Issue Type: Improvement
    Affects Versions: 0.16.0
            Reporter: Christian Kunz


On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.

fsck reports the system as healthy, although it is not.

I argue that fsck should have an option to check whether under-replicated blocks are okay.

Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And there should be an option to undo the corruption and recompute the checksums.

Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.