Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of dmitriy@twitter.com
 designates 209.85.218.217 as permitted sender)
MIME-Version: 1.0
Date: Thu, 1 Apr 2010 00:16:16 -0700
Message-ID: <h2j375c1f741004010016xd8d4787ayc82f96b34e13989d@mail.gmail.com>
Subject: Errors reading lzo-compressed files from Hadoop
From: Dmitriy Ryaboy <dmitriy@twitter.com>
To: common-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Hi folks,
We write a lot of lzo-compressed files to HDFS -- some via scribe,
some using internal tools. Occasionally, we discover that the created
lzo files cannot be read from HDFS -- they get through some (often
large) portion of the file, and then fail with the following stack
trace:

Exception in thread "main" java.lang.InternalError:
lzo1x_decompress_safe returned:
	at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native
Method)
	at com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:303)
	at com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:122)
	at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:223)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
	at java.io.InputStream.read(InputStream.java:85)
	at com.twitter.twadoop.jobs.LzoReadTest.main(LzoReadTest.java:51)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

The initial thought is of course that the lzo file is corrupt --
however, plain-jane lzop is able to read these files. Moreover, if we
pull the files out of hadoop, uncompress them, compress them again,
and put them back into HDFS, we can usually read them from HDFS as
well.

We've been thinking that this strange behavior is caused by a bug in
the hadoop-lzo libraries (we use the version with Twitter and Cloudera
fixes, on github: http://github.com/kevinweil/hadoop-lzo )
However, today I discovered that using the exact same environment,
codec, and InputStreams, we can successfully read from the local file
system, but cannot read from HDFS. This appears to point at possible
issues in the FSDataInputStream or further down the stack.

Here's a small test class that tries to read the same file from HDFS
and from the local FS, and the output of running it on our cluster.
We are using the CDH2 distribution.

https://gist.github.com/e1bf7e4327c7aef56303

Any ideas on what could be going on?

Thanks,
-Dmitriy