Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0BFAC200D27 for ; Wed, 11 Oct 2017 00:07:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0AA9D160BE3; Tue, 10 Oct 2017 22:07:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 50A2A160BE0 for ; Wed, 11 Oct 2017 00:07:07 +0200 (CEST) Received: (qmail 71790 invoked by uid 500); 10 Oct 2017 22:07:06 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 71779 invoked by uid 99); 10 Oct 2017 22:07:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Oct 2017 22:07:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 8315CD9D3E for ; Tue, 10 Oct 2017 22:07:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id i4ZH1id8T3gW for ; Tue, 10 Oct 2017 22:07:04 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id EBFD85F397 for ; Tue, 10 Oct 2017 22:07:03 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 9D5B1E0EAA for ; Tue, 10 Oct 2017 22:07:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id A825A25386 for ; Tue, 10 Oct 2017 22:07:00 +0000 (UTC) Date: Tue, 10 Oct 2017 22:07:00 +0000 (UTC) From: "Subru Krishnan (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-14919) BZip2 drops records when reading data in splits MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 10 Oct 2017 22:07:08 -0000 [ https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199444#comment-16199444 ] Subru Krishnan commented on HADOOP-14919: ----------------------------------------- Thanks [~jlowe] for fixing this and [~tanakahda] for validating the fix. I assume since it has been tested, it'll be committed in time for 2.9? > BZip2 drops records when reading data in splits > ----------------------------------------------- > > Key: HADOOP-14919 > URL: https://issues.apache.org/jira/browse/HADOOP-14919 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1 > Reporter: Aki Tanaka > Assignee: Jason Lowe > Priority: Critical > Attachments: 250000.bz2, HADOOP-14919-test.patch, HADOOP-14919.001.patch > > > BZip2 can drop records when reading data in splits. This problem was already discussed before in HADOOP-11445 and HADOOP-13270. But we still have a problem in corner case, causing lost data blocks. > > I attached a unit test for this issue. You can reproduce the problem if you run the unit test. > > First, this issue happens when position of newly created stream is equal to start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). However, the issue I am reporting does not happen when we run these tests because this issue happens only when the start of split byte block includes both block marker and compressed data. > > BZip2 block marker - 0x314159265359 (001100010100000101011001001001100101001101011001) > > blockEndingInCR.txt.bz2 (Start of Split - 136504): > {code:java} > $ xxd -l 6 -g 1 -b -seek 136498 ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2 > 0021532: 00110001 01000001 01011001 00100110 01010011 01011001 1AY&SY > {code} > > Test bz2 File (Start of Split - 203426) > {code:java} > $ xxd -l 7 -g 1 -b -seek 203419 250000.bz2 > 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011 .(+$.k > 0031aa1: 00101111 / > {code} > > Let's say a job splits this test bz2 file into two splits at the start of split (position 203426). > The former split does not read records which start position 203426 because BZip2 says the position of these dropped records is 203427. The latter split does not read the records because BZip2CompressionInputStream read the block from position 320955. > Due to this behavior, records between 203427 and 320955 are lost. > Also, if we reverted the changes in HADOOP-13270, we will not see this issue. We will see HADOOP-13270 issue though. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org