hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12990) lz4 incompatibility between OS and Hadoop
Date Thu, 19 Jan 2017 20:58:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830596#comment-15830596
] 

Jason Lowe commented on HADOOP-12990:
-------------------------------------

bq. My proposed approach is to create a new file based loosely on hadoop/io/compress/Lz4Codec.java
reproducing the byte structure analagous to your 4/3 hack. Does that seem reasonable? 

The problem with that approach is the existing codec is using one-shot decode.  If the amount
of data to decode exceeds the buffer size being used by the decoder it just blows up.  So
it's not as simple as tacking on a header and passing it through to the existing code.  That
will work with small amounts of data but not once it exceeds the decoder buffer size.

To handle arbitrary files generated by the LZ4 CLI tool it really needs to use the streaming
API so the buffer sizes used by the encoder and decoder are decoupled, and the decoder can
handle arbitrarily large amounts of input without needing to chunk it into one-shot blocks
as the current one does.


> lz4 incompatibility between OS and Hadoop
> -----------------------------------------
>
>                 Key: HADOOP-12990
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12990
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io, native
>    Affects Versions: 2.6.0
>            Reporter: John Zhuge
>            Priority: Minor
>
> {{hdfs dfs -text}} hit exception when trying to view the compression file created by
Linux lz4 tool.
> The Hadoop version has HADOOP-11184 "update lz4 to r123", thus it is using LZ4 library
in release r123.
> Linux lz4 version:
> {code}
> $ /tmp/lz4 -h 2>&1 | head -1
> *** LZ4 Compression CLI 64-bits r123, by Yann Collet (Apr  1 2016) ***
> {code}
> Test steps:
> {code}
> $ cat 10rows.txt
> 001|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 002|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 003|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 004|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 005|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 006|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 007|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 008|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 009|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 010|c1|c2|c3|c4|c5|c6|c7|c8|c9
> $ /tmp/lz4 10rows.txt 10rows.txt.r123.lz4
> Compressed 310 bytes into 105 bytes ==> 33.87%
> $ hdfs dfs -put 10rows.txt.r123.lz4 /tmp
> $ hdfs dfs -text /tmp/10rows.txt.r123.lz4
> 16/04/01 08:19:07 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>     at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:123)
>     at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98)
>     at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>     at java.io.InputStream.read(InputStream.java:101)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
>     at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:106)
>     at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:101)
>     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>     at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>     at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>     at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>     at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>     at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>     at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>     at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message