pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Reed <br...@yahoo-inc.com>
Subject Re: Compressed data file questions
Date Thu, 20 Dec 2007 16:47:03 GMT
Right now only bz2 files are supported. I have an open issue and the start of 
a patch for gzip support, but it's not there yet.

When Pig runs against a bz2 file and (in the future) a properly formatted 
gzip2 file, it chuncks up the file and then finds compression resync points 
to properly segment the compressed chunks of the file.

We have not tested with the record and block compression built into Hadoop. It 
should work; we just haven't tried it.


On Thursday 20 December 2007 06:17:14 Craig Macdonald wrote:
> I have some compression related questions
> Utkarsh Srivastava wrote:
>  >Also, you do not need to run on uncompressed data. You can run Pig
> directly on bz2 files.
> Actually, I just compressed it for transfer over the net, but this
> reminds me of questions I have.
> I have large compressed datasets on which I want to run Pig on.
> Eg one file (gzip compressed 9GB), 100 GB uncompressed.
> If I put this into Pig, does it compress each block separately (say
> option [a]), or does it just chunk the compressed file (which isnt
> handy) (option [b]). If Pig can handle the compressed files [and the
> latter is true], I assume that it has to stream the entire file to one
> map task - which defeats the purpose really.
> Hadoop has options called
> mapred.output.compression.type and mapred.map.output.compression.type
> Does these affect the question above?
> They are by default set to RECORD, I presume BLOCK enables option [a] -
> to some respect, but they both seem to affect the output of mapred, not
> the input?
> For the files above, it's not really handy to decompress each file, then
> load it into dfs.
> Thanks
> Craig

View raw message