hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Bigdatafun <sean.bigdata...@gmail.com>
Subject Re: Is a Block compressed (GZIP) SequenceFile splittable in MR operation?
Date Mon, 31 Jan 2011 17:11:54 GMT
On Mon, Jan 31, 2011 at 12:36 AM, Niels Basjes <Niels@basjes.nl> wrote:

> Hi,
> 2011/1/31 Sean Bigdatafun <sean.bigdatafun@gmail.com>:
> > GZIP is not splittable.
> Correct, gzip is a stream compression system which effectively means
> you can only start at the beginning of the data with decompressing.
> > Does that mean a GZIP block compressed sequencefile can't take advantage
> of MR parallelism?
> AFAIK it should be splittable in the same blocks as the compression was
> done.
Splittable within the same block?

Normally, each mapper would pick a HDFS block (64MB in an HDFS with default
configuration) of a 1GB file for map processing, should the file not GZIP
compressed --- this is a scenario for an unpressed file.

But as GZIP is not splittable, if/how can a mapper pick a block? (if it
can't, then we can't utilize the Mapreduce framework for the parallelism).

Can you give more answer?

> > How to control the size of block to be compressed in SequenceFile?
> Can't help you with that one.
> --
> Met vriendelijke groeten,
> Niels Basjes


View raw message