Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2EA5B11F34 for ; Mon, 22 Sep 2014 15:13:52 +0000 (UTC) Received: (qmail 18098 invoked by uid 500); 22 Sep 2014 15:13:47 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 17992 invoked by uid 500); 22 Sep 2014 15:13:47 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 17982 invoked by uid 99); 22 Sep 2014 15:13:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Sep 2014 15:13:47 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.219.50] (HELO mail-oa0-f50.google.com) (209.85.219.50) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Sep 2014 15:13:20 +0000 Received: by mail-oa0-f50.google.com with SMTP id jd19so3952582oac.37 for ; Mon, 22 Sep 2014 08:13:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=Jgx/8+WNWdNEUVDJjsxbkMZ8QrXiXIJVCPqwuFxi8yY=; b=Wp8YXE380g5DBKEDI5cVrcQP0bRNjqnBsi1duYjZZRzlIbzIPn9dSPwMUJv+Q91HcS W6nAAM3uiMYIF7LJYqPiZ1oXmjhlq3FpYu8al2KvaiWa4IKgf8EHxYrA/cAzqPjND+vi uJ5EQV2th5kRXwhMqKUki9eL7ryfTcSPf8DtPV4U+Ca8DAM90Jb05pBSqpXn2ryjidnJ cW6X+MKgfBHJG5cJ0aXWSO2INNm0Jo+g+yVmRr+f8IoaK8XU8kuk7osO5ZOdftEyWz/V Od350Y5x3yK2ojCULn5lXo0za0OO3kBAJQNHKa1E6zZHwn6w7K0LMIeD5kRpcr7OJzCn qTWw== X-Gm-Message-State: ALoCoQlbMR/N2Fwq8eKkHu+4q6uh0Tqv8lNMTzUUKkDq6JnSQGnlDKmaXv/NiKR4kTEJ5OKqgYE6 MIME-Version: 1.0 X-Received: by 10.60.124.130 with SMTP id mi2mr20011089oeb.25.1411398798559; Mon, 22 Sep 2014 08:13:18 -0700 (PDT) Sender: niels@basj.es Received: by 10.76.128.38 with HTTP; Mon, 22 Sep 2014 08:13:18 -0700 (PDT) X-Originating-IP: [171.33.133.58] In-Reply-To: <542034F6.2020903@vesseltracker.com> References: <542034F6.2020903@vesseltracker.com> Date: Mon, 22 Sep 2014 17:13:18 +0200 X-Google-Sender-Auth: knstaUGJQFrb8FTQ9WNnpxbC7c8 Message-ID: Subject: Re: Bzip2 files as an input to MR job From: Niels Basjes To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b41773b068e040503a8e41f X-Virus-Checked: Checked by ClamAV on apache.org --047d7b41773b068e040503a8e41f Content-Type: text/plain; charset=UTF-8 Hi, You can use the GZip inside the AVRO files and still have splittable AVRO files. This has the to with the fact that there is a block structure inside the AVRO and these blocks are gzipped. I suggest you simply try it. Niels On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov wrote: > Hi guys, > I would like to compress the files on HDFS to save some storage. > > As far as i see bzip2 is the only format which is splitable (and slow). > > The actual files are Avro. > > So in my driver class i have : > > job.setInputFormatClass(AvroKeyInputFormat.class); > > I have number of jobs running processing Avro files so i would like to > keep the code change to a minimum. > > Is it possible to comrpess these avro files with bzip2 and keep the code > of MR jobs the same (or with little change) > If it is , please give me some hints as so far i don't seem to find any > good resources on the Internet. > > > Georgi > -- Best regards / Met vriendelijke groeten, Niels Basjes --047d7b41773b068e040503a8e41f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,

You can use the GZip inside t= he AVRO files and still have splittable AVRO files.
This has the to wit= h the fact that there is a block structure inside the AVRO and these blocks= are gzipped.

I suggest you simply try it.
Niels

On Mon, S= ep 22, 2014 at 4:40 PM, Georgi Ivanov <ivanov@vesseltracker.com= > wrote:
Hi guys,
I would like to compress the files on HDFS to save some storage.

As far as i see bzip2 is the only format which is splitable (and slow).

The actual files are Avro.

So in my driver class i have :

job.setInputFormatClass(AvroKeyInputFormat.class);

I have number of jobs running processing Avro files so i would like to keep= the code change to a minimum.

Is it possible to comrpess these avro files with bzip2 and keep the code of= MR jobs the same (or with little change)
If it is , please give me some hints as so far i don't seem to find any= good resources on the Internet.


Georgi



-- Best regards / Met vriendelijke groeten,

Niels Basjes
--047d7b41773b068e040503a8e41f--