hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Douglas <chri...@yahoo-inc.com>
Subject Re: Data corruption when using Lzo Codec
Date Tue, 23 Sep 2008 01:09:43 GMT
If you're using TextInputFormat, you need to add LzoCodec to the list  
of codecs in the io.compression.codecs property.

LzopCodec is only for reading/writing files produced/consumed by the C  
tool; it's not in 0.17. The ".lzo" files produced in 0.17 are not  
"real" .lzo files, but that's how you can get the codec to recognize  
them in this version. In the future, you might want to just use the  
lzo codec with SequenceFileOutputFormat (use BLOCK compression). -C

On Sep 19, 2008, at 8:46 AM, Alex Feinberg wrote:

> Hi Chris,
>
> I was also unable to decompress by simply doing a map/reducer with  
> "cat"
> as a mapper and then doing dfs -get either.
>
> I will try using LzopCodec.
>
> Thanks,
> - Alex
>
> On Fri, Sep 19, 2008 at 2:34 AM, Chris Douglas <chrisdo@yahoo- 
> inc.com> wrote:
>> It's probably not corrupted. If by "compressed lzo file" you mean  
>> something
>> readable with lzop, you should use LzopCodec, not LzoCodec.  
>> LzoCodec doesn't
>> write header information required by that tool.
>>
>> Guessing at the output format (length encoded blocks of data  
>> compressed by
>> the lzo algorithm), it's probably readable by TextInputFormat, but  
>> YMMV. If
>> you wanted to use the C tool, you'll have to add the appropriate  
>> header (see
>> lzop source or LzopCodec) using a hex editor and four zero bytes to  
>> the end
>> of the file. You can also use lzo compression in SequenceFiles. -C
>>
>> On Sep 18, 2008, at 9:15 PM, Alex Feinberg wrote:
>>
>>> Hello,
>>>
>>> I am running a custom crawler (written internally) using hadoop
>>> streaming. I am attempting to
>>> compress the output using LZO, but instead I am receiving corrupted
>>> output that is neither in the
>>> format I am aiming for nor as a compressed lzo file. Is this a known
>>> issue? Is there anything
>>> I am doing inherently wrong?
>>>
>>> Here is the command line I am using:
>>>
>>> ~/hadoop/bin/hadoop jar
>>> /home/hadoop/hadoop/contrib/streaming/hadoop-0.17.2.1-streaming.jar
>>> -inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat
>>> -mapper /home/hadoop/crawl_map -reducer NONE -jobconf
>>> mapred.output.compress=true -jobconf
>>> mapred 
>>> .output.compression.codec=org.apache.hadoop.io.compress.LzoCodec
>>> -input pages -output crawl.lzo -jobconf mapred.reduce.tasks=0
>>>
>>> The input is in in form of URLs stored as a SequenceFile
>>>
>>> When running this without LZO compression, no such issue occurs.
>>>
>>> Is there any way for me to recover the corrupted data as to be  
>>> able to
>>> process it by other
>>> hadoop jobs or offline?
>>>
>>> Thanks,
>>>
>>> --
>>> Alex Feinberg
>>> Platform Engineer, SocialMedia Networks
>>
>>
>
>
>
> -- 
> Alex Feinberg
> Platform Engineer, SocialMedia Networks


Mime
View raw message