hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14919) BZip2 drops records when reading data in splits
Date Thu, 05 Oct 2017 01:58:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192365#comment-16192365
] 

Hadoop QA commented on HADOOP-14919:
------------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 16s{color} | {color:blue}
Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  0s{color} |
{color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m  0s{color}
| {color:green} The patch appears to include 1 new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 15s{color} | {color:blue}
Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 29s{color}
| {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 58s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 15s{color}
| {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 37s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 49s{color}
| {color:green} branch has no errors when building and testing our client artifacts. {color}
|
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 26s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 15s{color} |
{color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 21s{color} | {color:blue}
Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 13s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 13m  6s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 13m  6s{color} | {color:green}
the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  2m 19s{color}
| {color:orange} root: The patch generated 2 new + 118 unchanged - 9 fixed = 120 total (was
127) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 54s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m  0s{color}
| {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 57s{color}
| {color:green} patch has no errors when building and testing our client artifacts. {color}
|
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 37s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 40s{color} |
{color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  9m 22s{color} | {color:red}
hadoop-common in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}101m 37s{color} | {color:green}
hadoop-mapreduce-client-jobclient in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 45s{color}
| {color:green} The patch does not generate ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}216m 49s{color} | {color:black}
{color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.security.TestKDiag |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:71bbb86 |
| JIRA Issue | HADOOP-14919 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12890447/HADOOP-14919.001.patch
|
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  unit  shadedclient
 findbugs  checkstyle  |
| uname | Linux 081a67763b93 3.13.0-117-generic #164-Ubuntu SMP Fri Apr 7 11:05:26 UTC 2017
x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / cae1c73 |
| Default Java | 1.8.0_144 |
| findbugs | v3.1.0-RC1 |
| checkstyle | https://builds.apache.org/job/PreCommit-HADOOP-Build/13454/artifact/patchprocess/diff-checkstyle-root.txt
|
| unit | https://builds.apache.org/job/PreCommit-HADOOP-Build/13454/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common.txt
|
|  Test Results | https://builds.apache.org/job/PreCommit-HADOOP-Build/13454/testReport/ |
| modules | C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
U: . |
| Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/13454/console |
| Powered by | Apache Yetus 0.6.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> BZip2 drops records when reading data in splits
> -----------------------------------------------
>
>                 Key: HADOOP-14919
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14919
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>            Reporter: Aki Tanaka
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: 250000.bz2, HADOOP-14919.001.patch, HADOOP-14919-test.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already discussed
before in HADOOP-11445 and HADOOP-13270. But we still have a problem in corner case, causing
lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you run the unit
test.
>  
> First, this issue happens when position of newly created stream is equal to start of
split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker,
etc). However, the issue I am reporting does not happen when we run these tests because this
issue happens only when the start of split byte block includes both block marker and compressed
data.
>  
> BZip2 block marker - 0x314159265359 (001100010100000101011001001001100101001101011001)
>  
> blockEndingInCR.txt.bz2 (Start of Split - 136504):
> {code:java}
> $ xxd -l 6 -g 1 -b -seek 136498 ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
> 0021532: 00110001 01000001 01011001 00100110 01010011 01011001  1AY&SY
> {code}
>  
> Test bz2 File (Start of Split - 203426)
> {code:java}
> $ xxd -l 7 -g 1 -b -seek 203419 250000.bz2
> 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
> 0031aa1: 00101111                                               /
> {code}
>  
> Let's say a job splits this test bz2 file into two splits at the start of split (position
203426).
> The former split does not read records which start position 203426 because BZip2 says
the position of these dropped records is 203427. The latter split does not read the records
because BZip2CompressionInputStream read the block from position 320955.
> Due to this behavior, records between 203427 and 320955 are lost.
> Also, if we reverted the changes in HADOOP-13270, we will not see this issue. We will
see HADOOP-13270 issue though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message