hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Georgi Ivanov <iva...@vesseltracker.com>
Subject Re: Bzip2 files as an input to MR job
Date Mon, 22 Sep 2014 15:21:29 GMT
Hi Niels,
Thanks for the reply.
Changing the avro files is not really an option for me as it will 
require a lot of time( i have a lot ).
The Avro files themself are compressed a bit.
But still bzip2 gives 50% compression on one avro file.

So what i want is , to use Bzip2 compressed file as an input to my MR jobs.
Bzip2 is splittable.
Should be possible somehow , but i don't seem to find it atm.

On 22.09.2014 17:13, Niels Basjes wrote:
> Hi,
> You can use the GZip inside the AVRO files and still have splittable 
> AVRO files.
> This has the to with the fact that there is a block structure inside 
> the AVRO and these blocks are gzipped.
> I suggest you simply try it.
> Niels
> On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov 
> <ivanov@vesseltracker.com <mailto:ivanov@vesseltracker.com>> wrote:
>     Hi guys,
>     I would like to compress the files on HDFS to save some storage.
>     As far as i see bzip2 is the only format which is splitable (and
>     slow).
>     The actual files are Avro.
>     So in my driver class i have :
>     job.setInputFormatClass(AvroKeyInputFormat.class);
>     I have number of jobs running processing Avro files so i would
>     like to keep the code change to a minimum.
>     Is it possible to comrpess these avro files with bzip2 and keep
>     the code of MR jobs the same (or with little change)
>     If it is , please give me some hints as so far i don't seem to
>     find any good resources on the Internet.
>     Georgi
> -- 
> Best regards / Met vriendelijke groeten,
> Niels Basjes

View raw message