hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Feinberg" <a...@socialmedia.com>
Subject Re: Data corruption when using Lzo Codec
Date Fri, 19 Sep 2008 15:46:10 GMT
Hi Chris,

I was also unable to decompress by simply doing a map/reducer with "cat"
as a mapper and then doing dfs -get either.

 I will try using LzopCodec.

- Alex

On Fri, Sep 19, 2008 at 2:34 AM, Chris Douglas <chrisdo@yahoo-inc.com> wrote:
> It's probably not corrupted. If by "compressed lzo file" you mean something
> readable with lzop, you should use LzopCodec, not LzoCodec. LzoCodec doesn't
> write header information required by that tool.
> Guessing at the output format (length encoded blocks of data compressed by
> the lzo algorithm), it's probably readable by TextInputFormat, but YMMV. If
> you wanted to use the C tool, you'll have to add the appropriate header (see
> lzop source or LzopCodec) using a hex editor and four zero bytes to the end
> of the file. You can also use lzo compression in SequenceFiles. -C
> On Sep 18, 2008, at 9:15 PM, Alex Feinberg wrote:
>> Hello,
>> I am running a custom crawler (written internally) using hadoop
>> streaming. I am attempting to
>> compress the output using LZO, but instead I am receiving corrupted
>> output that is neither in the
>> format I am aiming for nor as a compressed lzo file. Is this a known
>> issue? Is there anything
>> I am doing inherently wrong?
>> Here is the command line I am using:
>> ~/hadoop/bin/hadoop jar
>> /home/hadoop/hadoop/contrib/streaming/hadoop-
>> -inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat
>> -mapper /home/hadoop/crawl_map -reducer NONE -jobconf
>> mapred.output.compress=true -jobconf
>> mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec
>> -input pages -output crawl.lzo -jobconf mapred.reduce.tasks=0
>> The input is in in form of URLs stored as a SequenceFile
>> When running this without LZO compression, no such issue occurs.
>> Is there any way for me to recover the corrupted data as to be able to
>> process it by other
>> hadoop jobs or offline?
>> Thanks,
>> --
>> Alex Feinberg
>> Platform Engineer, SocialMedia Networks

Alex Feinberg
Platform Engineer, SocialMedia Networks

View raw message