Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 4307 invoked from network); 3 Apr 2009 17:28:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Apr 2009 17:28:40 -0000 Received: (qmail 7492 invoked by uid 500); 3 Apr 2009 17:28:39 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 7428 invoked by uid 500); 3 Apr 2009 17:28:39 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 7418 invoked by uid 99); 3 Apr 2009 17:28:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Apr 2009 17:28:39 +0000 X-ASF-Spam-Status: No, hits=1.6 required=10.0 tests=MISSING_HEADERS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cutting@gmail.com designates 209.85.198.228 as permitted sender) Received: from [209.85.198.228] (HELO rv-out-0506.google.com) (209.85.198.228) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Apr 2009 17:28:30 +0000 Received: by rv-out-0506.google.com with SMTP id g37so355220rvb.5 for ; Fri, 03 Apr 2009 10:28:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=QW4DS5cIAXcGDwZbGEamNeba4kNDHYONTcZ8kRoUt7A=; b=EOlOA1+WmS8pFjrLflHUvp5mvDGDxQD3X/d4HtMFayrBeB9BcsXtVbgd/dFYIMMi8C CxKF/aELd0UFM/3aNLEOJ6sowQjkUBD4jGa1TZHJikv9f4003RbuWVCvfn8C0BWnw8Ml Sdjl1ij6Y89T6PAxxAFtigDfM3D6/PsQis4Ks= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; b=G4Alt/bWe1XOEYmFHHB9ZOGR1eygzqPkuvJyxrkr88Z70y1ZdEXtHhlIen+MEYQlLU MEiWBI8F6K53JuPCdS+tVf7kqvrSNwNHOch7p+Tvnw+sa2sj5jB/f7XZyVZi/VnzI0Dl 92nnwIlgH3ufQEKvnu2x2P2W0ou2EZYTX2J+c= Received: by 10.142.88.4 with SMTP id l4mr389155wfb.117.1238779688900; Fri, 03 Apr 2009 10:28:08 -0700 (PDT) Received: from ?192.168.168.16? (c-76-103-155-128.hsd1.ca.comcast.net [76.103.155.128]) by mx.google.com with ESMTPS id 24sm3288987wff.2.2009.04.03.10.28.07 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 03 Apr 2009 10:28:07 -0700 (PDT) Sender: Doug Cutting Message-ID: <49D64723.7020706@apache.org> Date: Fri, 03 Apr 2009 10:28:03 -0700 From: Doug Cutting User-Agent: Thunderbird 2.0.0.21 (X11/20090318) MIME-Version: 1.0 CC: general@hadoop.apache.org Subject: Re: [PROPOSAL] new subproject: Avro References: <49D53694.1050906@apache.org> <4CB9034E-05FB-4200-AF55-FFD78B2EEFCE@apache.org> <3c682ecd0904021711x41fe4dd2j291f2077284d5558@mail.gmail.com> <8BBAB2C9-FCF9-4261-9E4B-282CD4196FA2@apache.org> <49D63415.8060004@apache.org> <718F2DEF-B305-49C9-B62F-155D2F4CE12F@rapleaf.com> In-Reply-To: <718F2DEF-B305-49C9-B62F-155D2F4CE12F@rapleaf.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Bryan Duxbury wrote: > It sounds like what you want is the option avoid pre-generated classes. That's part of it. But, once you have the schema, you might as well take advantage of it. With the schema in hand, you don't need to tag data with field numbers or types, since that's all there in the schema. So, having the schema, you can use a simpler data format. Also, with the schema, resolving version differences is simplified. Developers don't need to assign field numbers, but can just use names. For performance, one can internally use field numbers while reading, to avoid string comparisons, but developers need no longer specify these, but can use names, as in most software. Here having the schema means we can simplify the IDL and its versioning semantics. > If that's the only thing you need, it seems like we could bolt that on > to Thrift with almost no work. Would you write parsers for Thrift's IDL in every language? Or would you use JSON, as Avro does, to avoid that? Once you're using a different IDL and a different data format, what's shared with Thrift? Fundamentally, those two things define a serialization system, no? > I assume you'd have the schema stored in > metadata or file header or something, right? (You wouldn't want to store > the field names in the binary encoding as strings, since that would > probably very quickly dwarf the size of the actual data in a lot of cases.) Yes, in data files the schema is typically stored in the metadata. > If my assumptions are correct, it seems like it'd be a lot smarter to > leverage existing Thrift infrastructure and encoding work rather than > duplicating it for this lone feature. What specific shared infrastructure would be leveraged? For Hadoop's RPC, I hope to adapt Hadoop's client and server implementations as a transport, as these have been highly tuned for Hadoop's performance requirements. Doug