hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shi Yu <sh...@uchicago.edu>
Subject Re: Split control in Lzo index
Date Thu, 23 Jun 2011 21:52:44 GMT
Thanks Dmitriy!

Not sure how much work it will be. I guess I should customize the 
InputFormat class in this case, right?

Shi
*
*On 6/23/2011 4:35 PM, Dmitriy Ryaboy wrote:
> Shi,
> bzip compresses much better than lzo. It is also significantly more
> expensive (we are talking orders of magnitude) than LZO, both on compression
> and decompression.
>
> As for your question regarding custom splits -- LzoIndex does not support
> this kind of logic, as it's written to be generic and doesn't know how to
> read individual records, but you can certainly customize it to fit your use
> case.
>
> D
>
>
>
> On Thu, Jun 23, 2011 at 1:59 PM, Shi Yu<shiyu@uchicago.edu>  wrote:
>
>> Hi,
>>
>> My specific question is: is it possible to control the split of Lzo files
>> by customize the Lzo index files?
>>
>> The background of the problem is:
>>
>> I have a file which has the following format
>>
>> key1 value1
>> key1 value2
>> key2 value3
>> key2 value4
>> ...
>>
>> Its size in plain text before compression is 11 M. After Lzo compression,
>> the size is 681 K.  I tried this on two formats:  Text format and Sequence
>> format with block compression. They are almost the same.
>>
>> However, when I join the same keys together and reformat the file as
>>
>> key1 value1 value2
>> key2 value3 value4
>> ...
>>
>> The size before compression is of course more or less the same, 11M. But
>> after Lzo compression, the size is 4.8 M.  My guess is: maybe the Lzo
>> compression algorithm could compress a lot of similar values in the first
>> format, whereas in the second format  the concatenation of multiple values
>> are less likely to be identical, therefore the compression rate decreases.
>>
>> So, again my question is, if I would like to keep the file in the first
>> format, I would prohibit mapper to split the file within the same key. For
>> example, all "key1" should  go to the same mapper. Is it doable on a Lzo
>> file? Because the split behavior of Lzo files relies on the index files, is
>> there anyway to control the split by customizing the Lzo index files?
>>
>> BTW, when using the second format, I found that bzip2 has better
>> compression rate than Lzo (2.1 M).  Did I made any mistake when using Lzo
>> compression?
>>
>> Thanks!
>>
>> Best Regards,
>>
>> Shi
>>
>>
>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message