Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4E1F511FA9 for ; Mon, 22 Sep 2014 15:21:48 +0000 (UTC) Received: (qmail 51099 invoked by uid 500); 22 Sep 2014 15:21:41 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 50989 invoked by uid 500); 22 Sep 2014 15:21:40 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 50978 invoked by uid 99); 22 Sep 2014 15:21:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Sep 2014 15:21:40 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [83.220.137.132] (HELO post.ynnor.de) (83.220.137.132) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Sep 2014 15:21:13 +0000 Received: from localhost (localhost [127.0.0.1]) by post.ynnor.de (Postfix) with ESMTP id 7C80A843922 for ; Mon, 22 Sep 2014 17:21:12 +0200 (CEST) Received: from [192.168.1.220] (ip1f120211.dynamic.kabel-deutschland.de [31.18.2.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by post.ynnor.de (Postfix) with ESMTPSA id 46D30840B41 for ; Mon, 22 Sep 2014 17:21:12 +0200 (CEST) Message-ID: <54203E79.7040202@vesseltracker.com> Date: Mon, 22 Sep 2014 17:21:29 +0200 From: Georgi Ivanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 To: user@hadoop.apache.org Subject: Re: Bzip2 files as an input to MR job References: <542034F6.2020903@vesseltracker.com> In-Reply-To: Content-Type: multipart/alternative; boundary="------------010004040501090608040103" X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. --------------010004040501090608040103 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Hi Niels, Thanks for the reply. Changing the avro files is not really an option for me as it will require a lot of time( i have a lot ). The Avro files themself are compressed a bit. But still bzip2 gives 50% compression on one avro file. So what i want is , to use Bzip2 compressed file as an input to my MR jobs. Bzip2 is splittable. Should be possible somehow , but i don't seem to find it atm. On 22.09.2014 17:13, Niels Basjes wrote: > Hi, > > You can use the GZip inside the AVRO files and still have splittable > AVRO files. > This has the to with the fact that there is a block structure inside > the AVRO and these blocks are gzipped. > > I suggest you simply try it. > > Niels > > > On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov > > wrote: > > Hi guys, > I would like to compress the files on HDFS to save some storage. > > As far as i see bzip2 is the only format which is splitable (and > slow). > > The actual files are Avro. > > So in my driver class i have : > > job.setInputFormatClass(AvroKeyInputFormat.class); > > I have number of jobs running processing Avro files so i would > like to keep the code change to a minimum. > > Is it possible to comrpess these avro files with bzip2 and keep > the code of MR jobs the same (or with little change) > If it is , please give me some hints as so far i don't seem to > find any good resources on the Internet. > > > Georgi > > > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes --------------010004040501090608040103 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 7bit
Hi Niels,
Thanks for the reply.
Changing the avro files is not really an option for me as it will require a lot of time( i have a lot ).
The Avro files themself are compressed a bit.
But still bzip2 gives 50% compression on one avro file.

So what i want is , to use Bzip2 compressed file as an input to my MR jobs.
Bzip2 is splittable.
Should be possible somehow , but i don't seem to find it atm.

On 22.09.2014 17:13, Niels Basjes wrote:
Hi,

You can use the GZip inside the AVRO files and still have splittable AVRO files.
This has the to with the fact that there is a block structure inside the AVRO and these blocks are gzipped.

I suggest you simply try it.

Niels


On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov <ivanov@vesseltracker.com> wrote:
Hi guys,
I would like to compress the files on HDFS to save some storage.

As far as i see bzip2 is the only format which is splitable (and slow).

The actual files are Avro.

So in my driver class i have :

job.setInputFormatClass(AvroKeyInputFormat.class);

I have number of jobs running processing Avro files so i would like to keep the code change to a minimum.

Is it possible to comrpess these avro files with bzip2 and keep the code of MR jobs the same (or with little change)
If it is , please give me some hints as so far i don't seem to find any good resources on the Internet.


Georgi



--
Best regards / Met vriendelijke groeten,

Niels Basjes

--------------010004040501090608040103--