hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small
Date Thu, 15 Feb 2018 17:13:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365954#comment-16365954
] 

Jason Lowe commented on HADOOP-15206:
-------------------------------------

bq. Deleted comments in the code

Sorry, I didn't mean the entire comment needs to be deleted.  I think the comments were very
helpful to explain why the logic is there, but I just didn't see the need to call out the
specific JIRA number.  That is something trivially obtained from git.

Speaking of comments, when they are reinstated I noticed that this comment is slightly incorrect:
{code}
        // HADOOP-15206: When we're in BYBLOCK mode and the start position
        // is >=0 and < HEADER_LEN + SUB_HEADER_LEN, we should also skip
        // to right after the BZip2 header to avoid duplicated records
        skipPos = HEADER_LEN + SUB_HEADER_LEN + 1 - this.startingPos;
{code}
"Skip to right after the BZip2 header" may lead someone to think there's an off-by-one bug
in the code.  We need to skip to right after the start of the first bz2 block (which occurs
right after the bz2 header).

Nit: skipPos is not really a position but rather the number of bytes being skipped, so it
looks incorrect when the code calls updateReportedByteCount on what appears to be a position
rather than a byte delta.  Something like numSkipped or numBytesSkipped would be a less confusing
name.

It would be nice to fix the checkstyle warning about line length on the comment.

The unit test failures appear to be unrelated, and they pass for me locally with the patch
applied.

> BZip2 drops and duplicates records when input split size is small
> -----------------------------------------------------------------
>
>                 Key: HADOOP-15206
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15206
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.3, 3.0.0
>            Reporter: Aki Tanaka
>            Assignee: Aki Tanaka
>            Priority: Major
>         Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, HADOOP-15206.002.patch,
HADOOP-15206.003.patch, HADOOP-15206.004.patch, HADOOP-15206.005.patch, HADOOP-15206.006.patch,
HADOOP-15206.007.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I confirmed that
this issue happens when the input split size is between 1byte and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(317))
- splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(318))
- splits[3]=file /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(308))
- conflict with 1 in split 4 at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message