pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Utkarsh Srivastava <utka...@yahoo-inc.com>
Subject Re: Compressed data file questions
Date Thu, 20 Dec 2007 19:56:06 GMT
I added some documentation on this:

http://wiki.apache.org/pig/ 
PigLatin#head-8e419219563705f0cbe965015fa85c2b6e59b168

Utkarsh

On Dec 20, 2007, at 6:17 AM, Craig Macdonald wrote:

> I have some compression related questions
>
> Utkarsh Srivastava wrote:
> >Also, you do not need to run on uncompressed data. You can run Pig  
> directly on bz2 files.
>
> Actually, I just compressed it for transfer over the net, but this  
> reminds me of questions I have.
>
> I have large compressed datasets on which I want to run Pig on.
> Eg one file (gzip compressed 9GB), 100 GB uncompressed.
>
> If I put this into Pig, does it compress each block separately (say  
> option [a]), or does it just chunk the compressed file (which isnt  
> handy) (option [b]). If Pig can handle the compressed files [and  
> the latter is true], I assume that it has to stream the entire file  
> to one map task - which defeats the purpose really.
>
> Hadoop has options called
> mapred.output.compression.type and mapred.map.output.compression.type
> Does these affect the question above?
> They are by default set to RECORD, I presume BLOCK enables option  
> [a] - to some respect, but they both seem to affect the output of  
> mapred, not the input?
>
> For the files above, it's not really handy to decompress each file,  
> then load it into dfs.
>
> Thanks
>
> Craig


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message