Return-Path: Delivered-To: apmail-hadoop-avro-user-archive@minotaur.apache.org Received: (qmail 83027 invoked from network); 12 Mar 2010 18:46:20 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Mar 2010 18:46:20 -0000 Received: (qmail 55724 invoked by uid 500); 12 Mar 2010 18:45:42 -0000 Delivered-To: apmail-hadoop-avro-user-archive@hadoop.apache.org Received: (qmail 55699 invoked by uid 500); 12 Mar 2010 18:45:42 -0000 Mailing-List: contact avro-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: avro-user@hadoop.apache.org Delivered-To: mailing list avro-user@hadoop.apache.org Received: (qmail 55691 invoked by uid 99); 12 Mar 2010 18:45:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Mar 2010 18:45:41 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [64.78.17.16] (HELO EXHUB018-1.exch018.msoutlookonline.net) (64.78.17.16) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Mar 2010 18:45:32 +0000 Received: from EXVMBX018-1.exch018.msoutlookonline.net ([64.78.17.47]) by EXHUB018-1.exch018.msoutlookonline.net ([64.78.17.16]) with mapi; Fri, 12 Mar 2010 10:45:11 -0800 From: Scott Carey To: "avro-user@hadoop.apache.org" Date: Fri, 12 Mar 2010 10:44:04 -0800 Subject: Re: file format stable? Thread-Topic: file format stable? Thread-Index: AcrCFCVCsoAzkbPnQDGgOeFCqxUTpA== Message-ID: <09F0FA5B-51B1-43A2-B8F3-F4C96769D401@richrelevance.com> References: <92c4d8c11003120344m10f428d9ja466c63eef185534@mail.gmail.com> <4B9A7A44.3010500@apache.org> <92c4d8c11003120935m43765959i7e6aa3b6371ff563@mail.gmail.com> In-Reply-To: <92c4d8c11003120935m43765959i7e6aa3b6371ff563@mail.gmail.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org On Mar 12, 2010, at 9:35 AM, Tim Sell wrote: > excellent! thanks for the response :) >=20 I have committed a large dataset to using the current format. The current = format will not be abandoned. The current format has its limitations. It is optimized for larger numbers= of smaller records ( ~ < 2K), and probably should not be used for records = significantly larger than 1MB. Essentially, it is built for the more typic= al Hadoop processing use case as well as structured data storage. The main drawbacks are: * Synchronous Logging -- the file is written in block size chunks, if one w= ants to commit a record to disk as soon as possible, each record has to be = its own block -- this is inefficient. * Large records -- blocks are read in as a whole, and currently need to fit= in memory in some implementations (including Java). We could relax this r= equirement for some compression codecs. * Large records -- the final block size has to be known before writing, cur= rently this is done by buffering in memory while writing. * One schema -- each file has one schema for all records within. This is a= very good simplification for most needs, but one cannot merge or concatena= te two files with different schemas, even for the most minor schema differe= nce.=20 Use cases that push the boundaries above may require a new and different fi= le format, or perhaps some sort of extension to the current format. -Scott > On 12 March 2010 17:30, Doug Cutting wrote: >> Tim Sell wrote: >>>=20 >>> But we're wondering if the file format is set in stone now >>=20 >> It should not change again. It did not seem that any were yet using the >> prior format, and it had some bad limitations, so we revised it. If it = ever >> does change again, we would require implementations to be back-compatibl= e, >> still able to read the old format. >>=20 >> Doug >>=20