Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of cutting@gmail.com designates
 209.85.198.228 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=sender:message-id:date:from:user-agent:mime-version:cc:subject
         :references:in-reply-to:content-type:content-transfer-encoding;
        b=G4Alt/bWe1XOEYmFHHB9ZOGR1eygzqPkuvJyxrkr88Z70y1ZdEXtHhlIen+MEYQlLU
         MEiWBI8F6K53JuPCdS+tVf7kqvrSNwNHOch7p+Tvnw+sa2sj5jB/f7XZyVZi/VnzI0Dl
         92nnwIlgH3ufQEKvnu2x2P2W0ou2EZYTX2J+c=
Sender: Doug Cutting <cutting@gmail.com>
Message-ID: <49D64723.7020706@apache.org>
Date: Fri, 03 Apr 2009 10:28:03 -0700
From: Doug Cutting <cutting@apache.org>
User-Agent: Thunderbird 2.0.0.21 (X11/20090318)
MIME-Version: 1.0
CC: general@hadoop.apache.org
Subject: Re: [PROPOSAL] new subproject: Avro
References: <49D53694.1050906@apache.org>
 <4CB9034E-05FB-4200-AF55-FFD78B2EEFCE@apache.org>
 <3c682ecd0904021711x41fe4dd2j291f2077284d5558@mail.gmail.com>
 <8BBAB2C9-FCF9-4261-9E4B-282CD4196FA2@apache.org>
 <49D63415.8060004@apache.org>
 <718F2DEF-B305-49C9-B62F-155D2F4CE12F@rapleaf.com>
In-Reply-To: <718F2DEF-B305-49C9-B62F-155D2F4CE12F@rapleaf.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Bryan Duxbury wrote:
> It sounds like what you want is the option avoid pre-generated classes.

That's part of it.  But, once you have the schema, you might as well 
take advantage of it.

With the schema in hand, you don't need to tag data with field numbers 
or types, since that's all there in the schema.  So, having the schema, 
you can use a simpler data format.

Also, with the schema, resolving version differences is simplified. 
Developers don't need to assign field numbers, but can just use names. 
For performance, one can internally use field numbers while reading, to 
avoid string comparisons, but developers need no longer specify these, 
but can use names, as in most software.  Here having the schema means we 
can simplify the IDL and its versioning semantics.

> If that's the only thing you need, it seems like we could bolt that on 
> to Thrift with almost no work.

Would you write parsers for Thrift's IDL in every language?  Or would 
you use JSON, as Avro does, to avoid that?

Once you're using a different IDL and a different data format, what's 
shared with Thrift?  Fundamentally, those two things define a 
serialization system, no?

> I assume you'd have the schema stored in 
> metadata or file header or something, right? (You wouldn't want to store 
> the field names in the binary encoding as strings, since that would 
> probably very quickly dwarf the size of the actual data in a lot of cases.)

Yes, in data files the schema is typically stored in the metadata.

> If my assumptions are correct, it seems like it'd be a lot smarter to 
> leverage existing Thrift infrastructure and encoding work rather than 
> duplicating it for this lone feature.

What specific shared infrastructure would be leveraged?  For Hadoop's 
RPC, I hope to adapt Hadoop's client and server implementations as a 
transport, as these have been highly tuned for Hadoop's performance 
requirements.

Doug