avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: questions about sort-orders
Date Thu, 02 Dec 2010 18:09:56 GMT

On Dec 2, 2010, at 7:30 AM, David Jeske wrote:

I like the inclusion of sort-order in avro, to enable different machines to sort and exchange.
I have a few suggestions to clarify the documentation. Please correct any assumptions I've
made that are incorrect...

It seems that sorts are not stable across schema versions. I think I understand why this makes
sense inside the schema philosophy, yet I think the documentation could clear up a couple
of the subtlties a bit more. For example, it says "data items may only be compared if they
have identical schemas". If I supply a source schema which avro can map into my target schema,
I would think it could load and compare things in my target schema. Is this correct? It might
be clarified.

There is some need for clarification.  As I understand it, things are sorted in the order
of the reader's schema, but I may be wrong.  If the schema changes, the sort order can change.
 There is no getting around that.  Usually as a schema evolves some things that were formerly
different become equal, and some things that were equal become different.  Typically, the
new schema's definition of order and equivalence is all that matters, so a sort will be consistent,
but unstable, with respect to the new schema.  But some schema changes will break that (such
as changing a field from ascending to descending order, or changing the order that fields
are compared).

Also, the comment "this permits data written by one system to be efficiently sorted by another
system", could callout that data items sorted in one schema may not be in the proper order
if during read they are mapped to a new version of the schema. In fact, it might be useful
for Avro to be able to tell me when it does the source->target schema mapping, whether
both schemas sorted in the same order (if it doesn't already).

It would be useful to provide whether the reader/writer schema resolution altered the sort
order or not.  I don't think we do this. The answer to that question is not as simple as a
yes/no answer however.  The sort order when migrated from an old schema to a new one may change
completely, or it may remain consistent but be unstable from the POV of the new schema, or
be both consistent and stable with respect to sorts using the prior schema.

Lastly, it says "Note also that Avro binary-encoded data can be efficiently ordered without
deserializing it to objects." What does this mean exactly?  This might be mis-interpreted
as saying one can lexicographically sort the binary-encoding without asking Avro to deserialize
it, and it'll be in a proper order. However, this seems obviously not true from the number
formats. Perhaps it would be clearer to say "Avro can efficiently make sort-comparisons on
binary-encoded data without allocating deserialization objects."

Did I properly understand those sort-related subtlties?

Yes, perhaps we should say "Avro can efficiently make sort-comparisons on binary data without
full deserialization" or something similar.

View raw message