Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 40012 invoked from network); 6 Apr 2009 17:37:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Apr 2009 17:37:02 -0000 Received: (qmail 76829 invoked by uid 500); 6 Apr 2009 17:37:02 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 76760 invoked by uid 500); 6 Apr 2009 17:37:02 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Delivered-To: moderator for general@hadoop.apache.org Received: (qmail 23852 invoked by uid 99); 6 Apr 2009 06:51:58 -0000 X-ASF-Spam-Status: No, hits=-8.0 required=10.0 tests=RCVD_IN_DNSWL_HI,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Chad.Walters@microsoft.com designates 131.107.115.214 as permitted sender) From: Chad Walters To: "general@hadoop.apache.org" Date: Sun, 5 Apr 2009 23:51:27 -0700 Subject: RE: [PROPOSAL] new subproject: Avro Thread-Topic: [PROPOSAL] new subproject: Avro Thread-Index: Acm2eIiMha+SR8YIcEuSMApUfi2fXgAC5Lx/ Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Doug, First, let me say that I think Avro has a lot of useful features -- feature= s that I would like to see fully supported in Thrift. At a minimum, I would like for us to be able to hash out the details to guarantee that there can really be full interoperability between Avro and Thrift. I am really interested in working cooperatively and collaboratively on this and I am willing to put in significant time on design and communication to help make full interoperability possible (I am unfortunately not able to contribute code directly at this time). Second, I think all of this decision about where Avro should live requires more thought and more discussion. I'd love to hear from more folks outside of Yahoo on this topic: so far all of the +1 votes have come from Yahoo employees. I'd also love to hear from other folks who have significant investments in both Thrift and Hadoop. Some points to think about: -- You suggest that there is not a lot in Thrift that Avro can leverage. I think you may be overlooking the fact that Thrift has a user base and a community of developers who are very interested in issues of cross-language data serialization and interoperability. Thrift has committers with expertise in a pretty big set of languages and leveraging this could get Avro's functionality onto more languages faster than the current path. Also= , there is in fact significant overlap between Hadoop users and Thrift users at this point, as well as significant use of Thrift in more than one Hadoop sub-project. At the code level, Thrift contains a transport abstraction and multiple different transport and server implementations in many different target languages. If there were closer collaboration, Avro could certainly benefit from leveraging the existing ones and any additional contributions in this area would benefit both projects. -- You also suggest that the two are largely disjoint from a technical perspective: "Thrift fundamentally standardizes an API, not a data format. Avro fundamentally is a data format specification, like XML." I agree with the fundamental part but I think that doesn't bring to light enough of what is in common and what is different for purposes of this discussion. Thrift specifies a type system, an API for data formats and transport mechanisms, a schema resolution algorithm, and provides implementations of several distinct data formats and transports. Avro specifies a single data format but it also brings along several other things as well, including a type system, specific RPC mechanism and a schem= a resolution algorithm. The most significant issue is that both of them specify a type system. At a very minimum I would like to see Avro and Thrift make agreements on that type system. The fact that there is significant existing investment in the Thrift type system by the Thrift community should weigh somewhere in this discussion. Obviously, the technical needs of Avro will also have weight there, but where there is room for choice, the Thrift choices should be respected. Arbitrary changes here will make it unnecessarily painful, perhaps impossible, for Thrift to directly adopt Avro and instead Thrift will be forced to make an "Avro-like" data specification, hampering interoperability for everyone. There may be pitfalls in the other areas of overlap as well that would prevent real interoperability -- let's elucidate them in further discussions. -- Avro appears to have 3 primary features that Thrift does not currently support sufficiently: 1. Schema serialization allowing for compact representation of files containing large numbers of records of identical types 2. Dynamic interpretation of schemas, which improves ease-of-use in dynamic languages (like the Python Hadoop Streaming use case) 3. Lazy partial deserialization to support "projection" Note that features 1 and 3 are independent of whether schemas are dynamicly interpreted or compiled into static bindings. WRT #1: Thrift's DenseProtocol goes some distance towards this although it doesn't go the whole way. Thrift can easily be extended to further compact the DenseProtocol's wire format for special cases where all fields are required. We have had significant discussions on the Thrift list about doin= g more in this area previously but we couldn't get folks from Hadoop who care= d most about this use case to participate with us on capturing a complete set of requirements and so there was no strong driver for it. WRT #2: I totally understand the case you make for dynamic interpretation i= n ad hoc data processing. I would love to see Thrift enhanced to do this kind of thing. WRT #3: Partial deserialization seems like a really useful feature for several use cases, not just for "projection". I think Thrift could and should be extended to support this functionality, and it should be availabl= e for both static bindings and dynamic schema interpretation via field names and field IDs where possible. -- You state: "Perhaps Thrift could be augmented to support Avro's JSON schemas and serialization. Then it could interoperate with other Avro-based systems. But then Thrift would have yet another serialization format, that every language would need to implement for it to be useful..." First, that "Perhaps" hides a lot of complexity and unless that is hashed out ahead of time I am pretty sure the real answer will be "Thrift cannot b= e augmented to support Avro directly but instead could be augmented to suppor= t something that looks quite a bit like Avro but differs in mostly unimportan= t ways." To me that seems like a shame. Furthermore, you say that last part ("Thrift would have yet another serialization format...") like it is a bad thing... Note that it is an explicit design goal of Thrift to allow for multiple different serializatio= n formats so that lots of different use cases can be supported by the same fundamental framework. This is a clear recognition that there is no one-size-fits-all answer for data serialization (fast RPC vs compact archival record data vs human readability, to name a few salient use cases)= . For a compelling enough use case, there is no reason not to port new protocols across multiple languages (generally done on an as-needed basis b= y someone who wants that functionality in that language). Another great feature of the protocol abstraction is that it allows data to be seamlessly moved from one serialization format to another as, say, it is read out of archival storage and sent on as RPC. Also, doesn't Avro essentially contain "another serialization format that every language would need to implement for it to be useful"? Seems like the same basic set of work to me, whether it is in Avro or Thrift. -- You state: "Avro fundamentally is a data format specification, like XML. Thrift could implement this specification. The Avro project includes reference implementations, but the format is intended to be simple enough and the specification stable enough that others might reasonably develop alternate, independent implementations." I think this is a bit inaccurate. First there is the issue of type system compatibility that I raised above and the plausibility of satisfying that "could" without refinement and collaboration on Avro's specification. Furthermore, stated goal of the subproject is "for Avro to replace both Hadoop's RPC and to be used for most Hadoop data files". This will bring in quite a bit beyond a reference implementation of a data format specification, especially depending on how many languages you intend to build RPC support for (Java, Python, C++ all mentioned at some point -- others?). I don't think it is unreasonable that the significant proportion of folks in the Hadoop community who are also using Thrift are puzzled abou= t why there isn't more consideration being given to convergence between Avro and Thrift. -- You state: "Also, with the schema, resolving version differences is simplified. Developers don't need to assign field numbers, but can just use names. For performance, one can internally use field numbers while reading, to avoid string comparisons, but developers need no longer specify these, but can use names, as in most software. Here having the schema means we can simplify the IDL and its versioning semantics." The simplification comes simply not having the field IDs in the IDL? I am not sure why having sequential id numbers after each field is considered to be so onerous. I honestly have never heard a single Thrift user complain about this. Anyone doing more than just that is doing something advanced that wouldn't be possible without the field IDs (like renaming a field). I think having to deal with JSON syntax in the Avro IDL is actually more annoying for humans than the application of field IDs, both with the added syntactic punctuation and the increased verbosity. If the field IDs are really so objectionable, Thrift could allow them to be optional for purely dynamic usages. I also don't see why matching names is considered easier than matching numbers, which is essentially what the versioning semantics come down to in the end. Am I missing something here? -- You state: "Would you write parsers for Thrift's IDL in every language? Or would you use JSON, as Avro does, to avoid that?" Here I totally agree with you: a JSON IDL is better for machine parsing tha= n Thrift's current IDL, which is targeted more at human parsing. And given that I agree that some form of dynamic interpretation is a useful feature, = I don't see any reason why a JSON version of the IDL couldn't become part of the picture. Furthermore, the Thrift IDL compiler could easily be extended to take this JSON format as both an input (in addition to the current Thrif= t IDL) and output. An alternative is would just be to have the other languages bind to the Thrift IDL parser directly -- most languages bind to C (granted for some it is easier than others) -- and get back the parsed data structure to interpret off of. -- By making Avro a sub-project of Hadoop, I believe you will succeed in producing an improved version of Hadoop Record IO and a better RPC mechanis= m than the current Hadoop RPC. However, I don't think that this will result i= n better general RPC than Thrift and it will certainly be much less performan= t for RPC in a wide range of applications. Consider an alternative: making Avro more like a sub-project of Thrift or just implementing it directly in Thrift. In that case, I think the end result will be a powerful and flexible "one-stop shop" for data serialization for RPC and archival purposes with the ability to bring both static and dynamic capabilities as needed for particular application purposes. To me this seems like a bigger win for both Hadoop and for Thrift= . Thanks for reading through to this point. I look forward to further discussion. Chad