hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shi Yu <sh...@uchicago.edu>
Subject Split control in Lzo index
Date Thu, 23 Jun 2011 20:59:42 GMT
Hi,

My specific question is: is it possible to control the split of Lzo 
files by customize the Lzo index files?

The background of the problem is:

I have a file which has the following format

key1 value1
key1 value2
key2 value3
key2 value4
...

Its size in plain text before compression is 11 M. After Lzo 
compression, the size is 681 K.  I tried this on two formats:  Text 
format and Sequence format with block compression. They are almost the 
same.

However, when I join the same keys together and reformat the file as

key1 value1 value2
key2 value3 value4
...

The size before compression is of course more or less the same, 11M. But 
after Lzo compression, the size is 4.8 M.  My guess is: maybe the Lzo 
compression algorithm could compress a lot of similar values in the 
first format, whereas in the second format  the concatenation of 
multiple values are less likely to be identical, therefore the 
compression rate decreases.

So, again my question is, if I would like to keep the file in the first 
format, I would prohibit mapper to split the file within the same key. 
For example, all "key1" should  go to the same mapper. Is it doable on a 
Lzo file? Because the split behavior of Lzo files relies on the index 
files, is there anyway to control the split by customizing the Lzo index 
files?

BTW, when using the second format, I found that bzip2 has better 
compression rate than Lzo (2.1 M).  Did I made any mistake when using 
Lzo compression?

Thanks!

Best Regards,

Shi



Mime
View raw message