Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9E27FDE9F for ; Thu, 5 Jul 2012 22:11:37 +0000 (UTC) Received: (qmail 72471 invoked by uid 500); 5 Jul 2012 22:11:37 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 72412 invoked by uid 500); 5 Jul 2012 22:11:37 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 72404 invoked by uid 99); 5 Jul 2012 22:11:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jul 2012 22:11:37 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of metaruslan@gmail.com designates 209.85.217.171 as permitted sender) Received: from [209.85.217.171] (HELO mail-lb0-f171.google.com) (209.85.217.171) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jul 2012 22:11:30 +0000 Received: by lbom4 with SMTP id m4so16820191lbo.30 for ; Thu, 05 Jul 2012 15:11:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Hwp3BVt1BS/zAfmh/YUgyTRhPaNnVir9CZR1tS2KT4A=; b=cwRPprcaZfQY6h1WhuzyOk2is0WhJwITnjPehPE7PAks1M1epTZWFBFccjEVrgL3Rz 7qErtt0Q7/Ea7AQR2dyAeAxv2RWWBdcL2A1P+SxNvcNY7NPvacXQcZ3MJzPL2pMiR/gU DJCyzXIDTzKMhsHr9ya5Z/V6XPJSAu3OUdjPN3WhNpH1a0B5dXyqXlM+NF7qep8jE7rZ Jzuhv1PrlHFDCfUyOszuDHCYP/lP4eK+N7ijoXp52vKQ2pQn0S5pWBDBpTA4jakiqJC4 vtMOz9+5FvG3DjIuVPqBCe1qTwcHA3jr70948XSPf6PG+OQgfOyspeQaxTS1SkrKppbj 9dAQ== MIME-Version: 1.0 Received: by 10.112.30.41 with SMTP id p9mr13001533lbh.26.1341526268957; Thu, 05 Jul 2012 15:11:08 -0700 (PDT) Received: by 10.114.1.148 with HTTP; Thu, 5 Jul 2012 15:11:08 -0700 (PDT) In-Reply-To: References: Date: Fri, 6 Jul 2012 02:11:08 +0400 Message-ID: Subject: Re: Avro file size is too big From: Ruslan Al-Fakikh To: user@avro.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hey Doug, Here is a little more of explanation http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E I'll answer your questions later after some investigation Thank you! On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting wrote: > Rusian, > > This is unexpected. Perhaps we can understand it if we have more information. > > What Writable class are you using for keys and values in the SequenceFile? > > What schema are you using in the Avro data file? > > Can you provide small sample files of each and/or code that will reproduce this? > > Thanks, > > Doug > > On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh wrote: >> Hello, >> >> In my organization currently we are evaluating Avro as a format. Our >> concern is file size. I've done some comparisons of a piece of our >> data. >> Say we have sequence files, compressed. The payload (values) are just >> lines. As far as I know we use line number as keys and we use the >> default codec for compression inside sequence files. The size is 1.6G, >> when I put it to avro with deflate codec with deflate level 9 it >> becomes 2.2G. >> This is interesting, because the values in seq files are just string, >> but Avro has a normal schema with primitive types. And those are kept >> binary. Shouldn't Avro be less in size? >> Also I took another dataset which is 28G (gzip files, plain >> tab-delimited text, don't know what is the deflate level) and put it >> to Avro and it became 38G >> Why Avro is so big in size? Am I missing some size optimization? >> >> Thanks in advance!