Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6450AD1A4 for ; Thu, 5 Jul 2012 17:24:03 +0000 (UTC) Received: (qmail 4758 invoked by uid 500); 5 Jul 2012 17:24:03 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 4701 invoked by uid 500); 5 Jul 2012 17:24:03 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 4693 invoked by uid 99); 5 Jul 2012 17:24:03 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jul 2012 17:24:03 +0000 Received: from localhost (HELO mail-lb0-f171.google.com) (127.0.0.1) (smtp-auth username cutting, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jul 2012 17:24:02 +0000 Received: by lbom4 with SMTP id m4so16374195lbo.30 for ; Thu, 05 Jul 2012 10:24:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.152.109.198 with SMTP id hu6mr26722556lab.21.1341509040858; Thu, 05 Jul 2012 10:24:00 -0700 (PDT) Received: by 10.112.85.225 with HTTP; Thu, 5 Jul 2012 10:24:00 -0700 (PDT) In-Reply-To: References: Date: Thu, 5 Jul 2012 10:24:00 -0700 Message-ID: Subject: Re: Avro file size is too big From: Doug Cutting To: user@avro.apache.org Content-Type: text/plain; charset=UTF-8 Rusian, This is unexpected. Perhaps we can understand it if we have more information. What Writable class are you using for keys and values in the SequenceFile? What schema are you using in the Avro data file? Can you provide small sample files of each and/or code that will reproduce this? Thanks, Doug On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh wrote: > Hello, > > In my organization currently we are evaluating Avro as a format. Our > concern is file size. I've done some comparisons of a piece of our > data. > Say we have sequence files, compressed. The payload (values) are just > lines. As far as I know we use line number as keys and we use the > default codec for compression inside sequence files. The size is 1.6G, > when I put it to avro with deflate codec with deflate level 9 it > becomes 2.2G. > This is interesting, because the values in seq files are just string, > but Avro has a normal schema with primitive types. And those are kept > binary. Shouldn't Avro be less in size? > Also I took another dataset which is 28G (gzip files, plain > tab-delimited text, don't know what is the deflate level) and put it > to Avro and it became 38G > Why Avro is so big in size? Am I missing some size optimization? > > Thanks in advance!