pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Craig Macdonald <cra...@dcs.gla.ac.uk>
Subject Compressed data file questions
Date Thu, 20 Dec 2007 14:17:14 GMT
I have some compression related questions

Utkarsh Srivastava wrote:
 >Also, you do not need to run on uncompressed data. You can run Pig 
directly on bz2 files.

Actually, I just compressed it for transfer over the net, but this 
reminds me of questions I have.

I have large compressed datasets on which I want to run Pig on.
Eg one file (gzip compressed 9GB), 100 GB uncompressed.

If I put this into Pig, does it compress each block separately (say 
option [a]), or does it just chunk the compressed file (which isnt 
handy) (option [b]). If Pig can handle the compressed files [and the 
latter is true], I assume that it has to stream the entire file to one 
map task - which defeats the purpose really.

Hadoop has options called
mapred.output.compression.type and mapred.map.output.compression.type
Does these affect the question above?
They are by default set to RECORD, I presume BLOCK enables option [a] - 
to some respect, but they both seem to affect the output of mapred, not 
the input?

For the files above, it's not really handy to decompress each file, then 
load it into dfs.

Thanks

Craig

Mime
View raw message