Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5608910896 for ; Tue, 30 Dec 2014 19:17:14 +0000 (UTC) Received: (qmail 21567 invoked by uid 500); 30 Dec 2014 19:17:14 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 21530 invoked by uid 500); 30 Dec 2014 19:17:14 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 21518 invoked by uid 99); 30 Dec 2014 19:17:14 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Dec 2014 19:17:14 +0000 Date: Tue, 30 Dec 2014 19:17:14 +0000 (UTC) From: "Ankit Kamboj (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-11445) Bzip2Codec: Data block is skipped when position of newly created stream is equal to start of split MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261378#comment-14261378 ] Ankit Kamboj commented on HADOOP-11445: --------------------------------------- Sorry I misread the results after patch application. It looks like -1 for overall is due to -1 for findbugs. But these warnings doesn't seem to be generated by the code that this patch touches. Could someone from the commiters please take a look and suggest? > Bzip2Codec: Data block is skipped when position of newly created stream is equal to start of split > -------------------------------------------------------------------------------------------------- > > Key: HADOOP-11445 > URL: https://issues.apache.org/jira/browse/HADOOP-11445 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.4.0 > Reporter: Ankit Kamboj > Attachments: HADOOP-11445.001.patch > > > bz2 input files are handled by FileInputFormat+LineRecordReader. In LineRecordReader, bz2 specific compressed input stream is created to iterate over records. After every new creation, the stream points to the beginning of next data block. The logic to find the beginning of next block depends on start of the split. The search begins at 10 bytes behind the start of split. If the first search creates input stream whose position is before or at start of split, next block beginning is sought (assuming that the record reader for previous split would have already iterated over the the data block in which current start of split lies). If the split start is just at the byte where a newly created stream is positioned (start of data block), attempt is made to find beginning of next data block. This doesn't seem correct because this will result in jumping a whole block and will result in missing records. -- This message was sent by Atlassian JIRA (v6.3.4#6332)