hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9442) Splitting issue when using NLineInputFormat with compression
Date Fri, 29 Mar 2013 13:23:15 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13617327#comment-13617327
] 

Suresh Srinivas commented on HADOOP-9442:
-----------------------------------------

bq. It could be a bug, for Hadoop not splitting compressed data correctly using NLineInputFormat.

The description of the jira made is sound like you were asking a question. There are many
such jiras created in Hadoop where jira is misused for asking questions. Perhaps this could
be a bug. So reopening is the right thing to do. I will ask someone with more mapreduce background
to comment on this.

I am also moving this to jira to MapReduce.
                
> Splitting issue when using NLineInputFormat with compression
> ------------------------------------------------------------
>
>                 Key: HADOOP-9442
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9442
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 1.1.2
>         Environment: Try in Apache Hadoop 1.1.1, CDH4, and Amazon EMR. Same result.
>            Reporter: Qiming He
>            Priority: Minor
>
> #make a long text line. It seems only long line text causing issue.
> $ cat abook.txt | base64 –w 0 >onelinetext.b64 #200KB+ long
> $ hadoop fs –put onelinetext.b64 /input/onelinetext.b64
> $ hadoop jar hadoop-streaming.jar  \
>     -input /input/onelinetext.b64 \
>     -output /output \
>     -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
>     –mapper wc 
> Num task: 1, and output has one line:
> Line 1: 1 2 202699
> which makes sense because one line per mapper is intended.
> Then, using compression with NLineInputFormat 
> $ bzip2 onelinetext.b64
> $ hadoop fs –put onelinetext.b64.bz2  /input/onelinetext.b64.bz2
> $ hadoop jar hadoop-streaming.jar \
>       -Dmapred.input.compress=true \
>       -Dmapred.input.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
>       -input /input/onelinetext.b64.gz \
>       -output /output \
>       -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
>       –mapper wc 
> I am expecting the same results as above, 'coz decompressing should occur before processing
one-line text (i.e. wc), however, I am getting:
> Num task: 397 (or other large numbers depend on environments), and output has 397 lines:
> Line1-396: 0 0 0
> Line 397: 1 2 202699
> Any idea why so many mapred.map.tasks >>1? Is it incorrect splitting? I purposely
choose gzip because I believe it is NOT split-able. I got similar results when using bzip2
and lzop codecs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message