Mailing-List: contact mapreduce-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-dev@hadoop.apache.org
Date: Thu, 15 Jan 2015 10:20:34 +0000 (UTC)
From: "Jens Rabe (JIRA)" <jira@apache.org>
To: mapreduce-dev@hadoop.apache.org
Message-ID: <JIRA.12767698.1421317213000.91077.1421317234405@Atlassian.JIRA>
In-Reply-To: <JIRA.12767698.1421317213000@Atlassian.JIRA>
References: <JIRA.12767698.1421317213000@Atlassian.JIRA>
 <JIRA.12767698.1421317213747@arcas>
Subject: [jira] [Created] (MAPREDUCE-6216) Seeking backwards in MapFiles
 does not always correctly sync the underlying SequenceFile, resulting in
 "File is corrupt" exceptions
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Jens Rabe created MAPREDUCE-6216:
------------------------------------

             Summary: Seeking backwards in MapFiles does not always correctly sync the underlying SequenceFile, resulting in "File is corrupt" exceptions
                 Key: MAPREDUCE-6216
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6216
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 2.4.1
            Reporter: Jens Rabe
            Priority: Critical


In some occasions, when reading MapFiles which were generated by MapFileOutputFormat with BZIP2 BLOCK compression, using getClosest(key, value, true) on the MapFile reader causes an IOException to be thrown with the message "File is corrupt!" When doing "hdfs fsck", it shows that everything is OK, and the underlying data and index files can also be read correctly if read with a SequenceFile.Reader.

The exception happens in the readBlock() method of the SequenceFile.Reader class.

My guess is that, since MapFile.Reader's seekInternal() method does "seek()" instead of "sync()", the indices in the index file must point to "synced" positions. When the exception occurrs, the position the cursor is to be positioned at is not valid.

So I think the culprit is the generation of the index files when MapFiles are output.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)