hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12990) lz4 incompatibility between OS and Hadoop
Date Tue, 18 Apr 2017 20:43:41 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973452#comment-15973452

Jason Lowe commented on HADOOP-12990:

My apologies for the delay, somehow I missed getting notified on your latest comment.

If this is really going to work as end-users expect then it needs to use the '.lz4' extension.
 That means going with the approach of a unified Lz4Codec that can read both the legacy Lz4Codec
and write the CLI-compatible one.  There would need to be a release note explaining the incompatibility
for data written by the newer codec for clusters still running the older codec.  There are
some other caveats, since this could cause issues for a rolling downgrade (i.e.: data written
by the new codec before the downgrade can't be decoded after the downgrade).  We can mitigate
this by making the output format of the new codec configurable and setting the default to
be the legacy format, but then of course it doesn't work "out of the box" with the lz4 CLI
tool which will be surprising to some.

Anyway the first step is to see if the first proposal is even feasible -- can the codec reliably
auto-detect which format is being used and properly decode both the legacy and CLI-compatible
formats.  If that possibility exists then we can work through the logistics of whether the
new codec emits the CLI-compatible format by default and how to handle the compatibility scenarios.

> lz4 incompatibility between OS and Hadoop
> -----------------------------------------
>                 Key: HADOOP-12990
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12990
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io, native
>    Affects Versions: 2.6.0
>            Reporter: John Zhuge
>            Priority: Minor
> {{hdfs dfs -text}} hit exception when trying to view the compression file created by
Linux lz4 tool.
> The Hadoop version has HADOOP-11184 "update lz4 to r123", thus it is using LZ4 library
in release r123.
> Linux lz4 version:
> {code}
> $ /tmp/lz4 -h 2>&1 | head -1
> *** LZ4 Compression CLI 64-bits r123, by Yann Collet (Apr  1 2016) ***
> {code}
> Test steps:
> {code}
> $ cat 10rows.txt
> 001|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 002|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 003|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 004|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 005|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 006|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 007|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 008|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 009|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 010|c1|c2|c3|c4|c5|c6|c7|c8|c9
> $ /tmp/lz4 10rows.txt 10rows.txt.r123.lz4
> Compressed 310 bytes into 105 bytes ==> 33.87%
> $ hdfs dfs -put 10rows.txt.r123.lz4 /tmp
> $ hdfs dfs -text /tmp/10rows.txt.r123.lz4
> 16/04/01 08:19:07 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>     at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:123)
>     at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98)
>     at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>     at java.io.InputStream.read(InputStream.java:101)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
>     at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:106)
>     at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:101)
>     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>     at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>     at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>     at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>     at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>     at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>     at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>     at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> {code}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message