Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 39CF017D2D for ; Thu, 15 Jan 2015 10:45:34 +0000 (UTC) Received: (qmail 82887 invoked by uid 500); 15 Jan 2015 10:45:35 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 82823 invoked by uid 500); 15 Jan 2015 10:45:35 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 82811 invoked by uid 99); 15 Jan 2015 10:45:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Jan 2015 10:45:35 +0000 Date: Thu, 15 Jan 2015 10:45:35 +0000 (UTC) From: "Jens Rabe (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MAPREDUCE-6216) Seeking backwards in MapFiles does not always correctly sync the underlying SequenceFile, resulting in "File is corrupt" exceptions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jens Rabe updated MAPREDUCE-6216: --------------------------------- Description: In some occasions, when reading MapFiles which were generated by MapFileOutputFormat with BZIP2 BLOCK compression, using getClosest(key, value, true) on the MapFile reader causes an IOException to be thrown with the message "File is corrupt!" When doing "hdfs fsck", it shows that everything is OK, and the underlying data and index files can also be read correctly if read with a SequenceFile.Reader. The exception happens in the readBlock() method of the SequenceFile.Reader class. My guess is that, since MapFile.Reader's seekInternal() method does "seek()" instead of "sync()", it is not correctly checked if the cursor is really positioned at a valid location. was: In some occasions, when reading MapFiles which were generated by MapFileOutputFormat with BZIP2 BLOCK compression, using getClosest(key, value, true) on the MapFile reader causes an IOException to be thrown with the message "File is corrupt!" When doing "hdfs fsck", it shows that everything is OK, and the underlying data and index files can also be read correctly if read with a SequenceFile.Reader. The exception happens in the readBlock() method of the SequenceFile.Reader class. My guess is that, since MapFile.Reader's seekInternal() method does "seek()" instead of "sync()", the indices in the index file must point to "synced" positions. When the exception occurrs, the position the cursor is to be positioned at is not valid. So I think the culprit is the generation of the index files when MapFiles are output. > Seeking backwards in MapFiles does not always correctly sync the underlying SequenceFile, resulting in "File is corrupt" exceptions > ----------------------------------------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-6216 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6216 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.4.1 > Reporter: Jens Rabe > Priority: Critical > Labels: mapfile, sequencefile > > In some occasions, when reading MapFiles which were generated by MapFileOutputFormat with BZIP2 BLOCK compression, using getClosest(key, value, true) on the MapFile reader causes an IOException to be thrown with the message "File is corrupt!" When doing "hdfs fsck", it shows that everything is OK, and the underlying data and index files can also be read correctly if read with a SequenceFile.Reader. > The exception happens in the readBlock() method of the SequenceFile.Reader class. > My guess is that, since MapFile.Reader's seekInternal() method does "seek()" instead of "sync()", it is not correctly checked if the cursor is really positioned at a valid location. -- This message was sent by Atlassian JIRA (v6.3.4#6332)