pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Koontz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-3208) [zebra] TFile should not set io.compression.codec.lzo.buffersize
Date Tue, 19 Mar 2013 20:37:15 GMT

    [ https://issues.apache.org/jira/browse/PIG-3208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13606756#comment-13606756
] 

Eugene Koontz commented on PIG-3208:
------------------------------------

Hi Xuefu, Daniel and Dmitriy,

Thanks for considering this patch!

Xuefu, yes, Zebra will take its default from the cluster. If you don't set it, then it will
be set by the version of hadoop-lzo that you use. For example, if you use trunk, it will be
256*1024; please see:

https://github.com/twitter/hadoop-lzo/blob/master/src/java/com/hadoop/compression/lzo/LzoCodec.java#L51

See also discussion between @kevinweil and @tlipcon here: 

https://github.com/twitter/hadoop-lzo/commit/3afba3037be4152fcc088ddb13f2ec12e659ed9c

-Eugene
                
> [zebra] TFile should not set io.compression.codec.lzo.buffersize
> ----------------------------------------------------------------
>
>                 Key: PIG-3208
>                 URL: https://issues.apache.org/jira/browse/PIG-3208
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: PIG-3208.patch
>
>
> In contrib/zebra/src/java/org/apache/hadoop/zebra/tfile/Compression.java, the following
occurs:
> {code}
> conf.setInt("io.compression.codec.lzo.buffersize", 64 * 1024);
> {code}
> This can cause the LZO decompressor, if called within the context of reading TFiles,
to return with an error code when trying to uncompress LZO-compressed data, if the data's
compressed size is too large to fit in 64 * 1024 bytes.
> For example, the Hadoop-LZO code uses a different default value (256 * 1024):
> https://github.com/twitter/hadoop-lzo/blob/master/src/java/com/hadoop/compression/lzo/LzoCodec.java#L185
> This can lead to a case where, if data is compressed with a cluster where the default
{{io.compression.codec.lzo.buffersize}} = 256*1024 is used, then code that tries to read this
data by using Pig's zebra, the Mapper will exit with code 134 because the LZO compressor returns
a -4 (which encodes the LZO C library error LZO_E_INPUT_OVERRUN) when trying to uncompress
the data. The stack trace of such a case is shown below:
> {code}
> 2013-02-17 14:47:50,709 INFO com.hadoop.compression.lzo.LzoCodec: Creating stream for
compressor: com.hadoop.compression.lzo.LzoCompressor@6818c458 with bufferSize: 262144
> 2013-02-17 14:47:50,849 INFO org.apache.hadoop.io.compress.CodecPool: Paying back codec:
com.hadoop.compression.lzo.LzoCompressor@6818c458
> 2013-02-17 14:47:50,849 INFO org.apache.hadoop.mapred.MapTask: Finished spill 3
> 2013-02-17 14:47:50,857 INFO org.apache.hadoop.io.compress.CodecPool: Borrowing codec:
com.hadoop.compression.lzo.LzoCompressor@6818c458
> 2013-02-17 14:47:50,866 INFO com.hadoop.compression.lzo.LzoCodec: Creating stream for
compressor: com.hadoop.compression.lzo.LzoCompressor@6818c458 with bufferSize: 262144
> 2013-02-17 14:47:50,879 INFO org.apache.hadoop.io.compress.CodecPool: Paying back codec:
com.hadoop.compression.lzo.LzoCompressor@6818c458
> 2013-02-17 14:47:50,879 INFO org.apache.hadoop.mapred.MapTask: Finished spill 4
> 2013-02-17 14:47:50,887 INFO org.apache.hadoop.mapred.Merger: Merging 5 sorted segments
> 2013-02-17 14:47:50,890 INFO org.apache.hadoop.io.compress.CodecPool: Borrowing codec:
com.hadoop.compression.lzo.LzoDecompressor@66a23610
> 2013-02-17 14:47:50,891 INFO com.hadoop.compression.lzo.LzoDecompressor: calling decompressBytesDirect
with buffer with: position: 0 and limit: 262144
> 2013-02-17 14:47:50,891 INFO com.hadoop.compression.lzo.LzoDecompressor: read: 245688
bytes from decompressor.
> 2013-02-17 14:47:50,891 INFO org.apache.hadoop.io.compress.CodecPool: Borrowing codec:
com.hadoop.compression.lzo.LzoDecompressor@43684706
> 2013-02-17 14:47:50,892 INFO com.hadoop.compression.lzo.LzoDecompressor: calling decompressBytesDirect
with buffer with: position: 0 and limit: 65536
> 2013-02-17 14:47:50,895 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing
logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2013-02-17 14:47:50,897 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.InternalError:
lzo1x_decompress returned: -4
>         at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method)
>         at com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:307)
>         at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
>         at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
>         at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:341)
>         at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:371)
>         at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:355)
>         at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:387)
>         at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
>         at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420)
>         at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
>         at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
>         at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548)
>         at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180)
>         at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1213)
>         at org.apache.hadoop.mapred.Child.main(Child.java:264)
> {code}
> (Some additional LOG.info statements were added to produce the above output which are
not in this patch).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message