From Ashish Thusoo <>
Subject RE: Compression
Date Tue, 02 Dec 2008 11:21:42 GMT
Can't we set up proper codecs for sequence files. 

From: Josh Ferguson []
Sent: Tuesday, December 02, 2008 1:37 AM
Subject: Re: Compression

I'm not sure, from their wiki:

Compressed Input

Compressed files are difficult to process in parallel, since they cannot, in general, be split
into fragments and independently decompressed. However, if the compression is block-oriented
(e.g. bz2), the splitting and parallel processing is easy to do.

Pig has inbuilt support for processing .bz2 files in parallel (.gz support is coming soon).
If the input file name extension is .bz2, Pig decompresses the file on the fly and passes
the decompressed input stream to your load function. For example,

A = LOAD 'input.bz2' USING myLoad();

Multiple instances of myLoad() (as dictated by the degree of parallelism) will be created
and each will be given a fragment of the *decompressed* version of input.bz2 to process.

On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:

Can you give a little more details?
For example, you tried a single .bz file as input, and the pig job has 2 or more mappers?

I didn't know bz2 was splittable.

On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <<>>
It is splittable because of how the compression uses blocks, Pig does this out of the box.


On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:

It shouldn't be a problem for Hive to support it (by defining your own input/output file format
that does the decompression on the flyer), but we won't be able to parallelize the execution
as we do with uncompressed text files, and sequence files, since bz2 compression is not splittable.


