Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C5D98DECB for ; Thu, 5 Jul 2012 22:19:23 +0000 (UTC) Received: (qmail 97937 invoked by uid 500); 5 Jul 2012 22:19:23 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 97874 invoked by uid 500); 5 Jul 2012 22:19:23 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 97864 invoked by uid 99); 5 Jul 2012 22:19:23 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jul 2012 22:19:23 +0000 Received: from localhost (HELO mail-lb0-f171.google.com) (127.0.0.1) (smtp-auth username cutting, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jul 2012 22:19:23 +0000 Received: by lbom4 with SMTP id m4so16832095lbo.30 for ; Thu, 05 Jul 2012 15:19:21 -0700 (PDT) MIME-Version: 1.0 Received: by 10.112.43.129 with SMTP id w1mr2371360lbl.61.1341526761026; Thu, 05 Jul 2012 15:19:21 -0700 (PDT) Received: by 10.112.85.225 with HTTP; Thu, 5 Jul 2012 15:19:20 -0700 (PDT) In-Reply-To: References: Date: Thu, 5 Jul 2012 15:19:20 -0700 Message-ID: Subject: Re: Avro file size is too big From: Doug Cutting To: user@avro.apache.org Content-Type: text/plain; charset=UTF-8 You can use the Avro command-line tool to dump the metadata, which will show the schema and codec: java -jar avro-tools.jar getmeta Doug On Thu, Jul 5, 2012 at 3:11 PM, Ruslan Al-Fakikh wrote: > Hey Doug, > > Here is a little more of explanation > http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E > I'll answer your questions later after some investigation > > Thank you! > > > On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting wrote: >> Rusian, >> >> This is unexpected. Perhaps we can understand it if we have more information. >> >> What Writable class are you using for keys and values in the SequenceFile? >> >> What schema are you using in the Avro data file? >> >> Can you provide small sample files of each and/or code that will reproduce this? >> >> Thanks, >> >> Doug >> >> On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh wrote: >>> Hello, >>> >>> In my organization currently we are evaluating Avro as a format. Our >>> concern is file size. I've done some comparisons of a piece of our >>> data. >>> Say we have sequence files, compressed. The payload (values) are just >>> lines. As far as I know we use line number as keys and we use the >>> default codec for compression inside sequence files. The size is 1.6G, >>> when I put it to avro with deflate codec with deflate level 9 it >>> becomes 2.2G. >>> This is interesting, because the values in seq files are just string, >>> but Avro has a normal schema with primitive types. And those are kept >>> binary. Shouldn't Avro be less in size? >>> Also I took another dataset which is 28G (gzip files, plain >>> tab-delimited text, don't know what is the deflate level) and put it >>> to Avro and it became 38G >>> Why Avro is so big in size? Am I missing some size optimization? >>> >>> Thanks in advance!