From general-return-162-apmail-hadoop-general-archive=hadoop.apache.org@hadoop.apache.org Fri Apr 03 20:03:11 2009 Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 76294 invoked from network); 3 Apr 2009 20:03:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Apr 2009 20:03:11 -0000 Received: (qmail 98887 invoked by uid 500); 3 Apr 2009 20:03:11 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 98803 invoked by uid 500); 3 Apr 2009 20:03:10 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 98793 invoked by uid 99); 3 Apr 2009 20:03:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Apr 2009 20:03:10 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cutting@gmail.com designates 209.85.200.175 as permitted sender) Received: from [209.85.200.175] (HELO wf-out-1314.google.com) (209.85.200.175) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Apr 2009 20:03:02 +0000 Received: by wf-out-1314.google.com with SMTP id 23so1239066wfg.2 for ; Fri, 03 Apr 2009 13:02:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=0wrte25rKZ80+aEA93U7ChcBHYp4e8aklH+uJCUhxv8=; b=r1nQe84ao/72FItl/kY3EEKg0Pko5xpi9gC6TJgbONUzX5f2cJKDsoYsFbcdhkzSsB 5usapylOh14DWaN5B1dQJlmgoNPoFU99TkBvjt1LYvayABXePUgYSveVloWuIISXCQPY hIOBP7mj24pZdQcyJIBFCCgohGGIL67TLfAw8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:subject :references:in-reply-to:content-type:content-transfer-encoding; b=aLi06On5Rzdyeb+0G1QPigbOX5O2oJLs71WzeiC1rINrCwMuXHFYAL3LtuHK3QtYzc WvYes+NIbNifHajdBrJah2bkg/7XqVXmAQSAyrbq05Trlw1pViI9hKjfWJBFBtlVsX1f LpPJTkOUOtcnlBnCS+GvCorxiy06wBEVqnvN8= Received: by 10.142.52.7 with SMTP id z7mr414122wfz.267.1238788962322; Fri, 03 Apr 2009 13:02:42 -0700 (PDT) Received: from ?192.168.168.16? (c-76-103-155-128.hsd1.ca.comcast.net [76.103.155.128]) by mx.google.com with ESMTPS id 22sm3454385wfg.3.2009.04.03.13.02.40 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 03 Apr 2009 13:02:41 -0700 (PDT) Sender: Doug Cutting Message-ID: <49D66B40.5070509@apache.org> Date: Fri, 03 Apr 2009 13:02:08 -0700 From: Doug Cutting User-Agent: Thunderbird 2.0.0.21 (X11/20090318) MIME-Version: 1.0 To: general@hadoop.apache.org Subject: Re: [PROPOSAL] new subproject: Avro References: <49D53694.1050906@apache.org> <4CB9034E-05FB-4200-AF55-FFD78B2EEFCE@apache.org> <3c682ecd0904021711x41fe4dd2j291f2077284d5558@mail.gmail.com> <8BBAB2C9-FCF9-4261-9E4B-282CD4196FA2@apache.org> <49D63415.8060004@apache.org> <718F2DEF-B305-49C9-B62F-155D2F4CE12F@rapleaf.com> <49D64723.7020706@apache.org> <49D6577F.4080307@apache.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org George Porter wrote: > While this representation would certainly be as compact as possible, > wouldn't it prevent evolving the data structure over time? One of the > nice features of Google Protocol Buffers and Thrift is that you can > evolve the set of fields over time, and older/newer clients can talk to > older/newer services. If the proposed Avro is evolvable, then perhaps > I'm misunderstanding your statement about the lack of IDs in the > serialized data. Avro supports schema evolution. In Avro, the schema used to write the data must be available when the data is read. (In files, it is typically stored in the file metadata.) If you have the schema that was used to write the data, and you're expecting a slightly different schema, then you simply keep those fields that are in both schemas and skip those not. This is equivalent to Thrift and Protocol Buffer's support for schema evolution, but does not require manually assigning numeric field ids. This feature can also be used to support projection. If you have records with many large fields, but only need a single field in a particular computation, then you can specify an expected schema with only that field, and the runtime will efficiently skip all of the other fields, returning a record with just the single, expected field. > I also agree with Bryan, in that it would be unfortunate to have two > different Apache projects with overlapping goals. We already have both Thrift and Etch in the incubator, which have similar goals. Apache does not attempt to mandate that projects have disjoint goals. There are many ways to slice things, and Apache prefers to rely on survival of the fittest rather than forcing things together. > Regardless of > features, both protocol buffers and thrift have the advantage of being > debugged in mission-critical production environments. Yes, but, as I've argued in other messages in this thread, they do not support the dynamic features we need. Adding those features would add new code that would share little with existing code in those projects. So, while the projects are conceptually similar, the implementations are necessarily different, and, without significant code overlap, separate projects seem more natural. Doug