On Apr 3, 2009, at 1:02 PM, Doug Cutting wrote: > George Porter wrote: >> While this representation would certainly be as compact as >> possible, wouldn't it prevent evolving the data structure over >> time? One of the nice features of Google Protocol Buffers and >> Thrift is that you can evolve the set of fields over time, and >> older/newer clients can talk to older/newer services. If the >> proposed Avro is evolvable, then perhaps I'm misunderstanding your >> statement about the lack of IDs in the serialized data. > > Avro supports schema evolution. In Avro, the schema used to write > the data must be available when the data is read. (In files, it is > typically stored in the file metadata.) > > If you have the schema that was used to write the data, and you're > expecting a slightly different schema, then you simply keep those > fields that are in both schemas and skip those not. This is > equivalent to Thrift and Protocol Buffer's support for schema > evolution, but does not require manually assigning numeric field ids. > > This feature can also be used to support projection. If you have > records with many large fields, but only need a single field in a > particular computation, then you can specify an expected schema with > only that field, and the runtime will efficiently skip all of the > other fields, returning a record with just the single, expected field. Thanks for the clarification--I better understand the schema relationship. The projection feature is a nice feature, especially since it seems like it would be able to support "sparse files" where you want to just peek at large structs without invoking a lot of disk- io (for data serialized on-disk). > > >> I also agree with Bryan, in that it would be unfortunate to have >> two different Apache projects with overlapping goals. > > We already have both Thrift and Etch in the incubator, which have > similar goals. Apache does not attempt to mandate that projects > have disjoint goals. There are many ways to slice things, and > Apache prefers to rely on survival of the fittest rather than > forcing things together. > >> Regardless of features, both protocol buffers and thrift have the >> advantage of being debugged in mission-critical production >> environments. > > Yes, but, as I've argued in other messages in this thread, they do > not support the dynamic features we need. Adding those features > would add new code that would share little with existing code in > those projects. So, while the projects are conceptually similar, the > implementations are necessarily different, and, without significant > code overlap, separate projects seem more natural. > > Doug Makes sense. Thanks, George