hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: Bzip2 files as an input to MR job
Date Mon, 22 Sep 2014 15:13:18 GMT

You can use the GZip inside the AVRO files and still have splittable AVRO
This has the to with the fact that there is a block structure inside the
AVRO and these blocks are gzipped.

I suggest you simply try it.


On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov <ivanov@vesseltracker.com>

> Hi guys,
> I would like to compress the files on HDFS to save some storage.
> As far as i see bzip2 is the only format which is splitable (and slow).
> The actual files are Avro.
> So in my driver class i have :
> job.setInputFormatClass(AvroKeyInputFormat.class);
> I have number of jobs running processing Avro files so i would like to
> keep the code change to a minimum.
> Is it possible to comrpess these avro files with bzip2 and keep the code
> of MR jobs the same (or with little change)
> If it is , please give me some hints as so far i don't seem to find any
> good resources on the Internet.
> Georgi

Best regards / Met vriendelijke groeten,

Niels Basjes

View raw message