avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: spec changes & schedule
Date Fri, 24 Apr 2009 23:32:01 GMT
Scott Carey wrote:
> In particular, a large number of use cases are only positive
> integers, or expect only very small negative numbers to exist.  Is a
> positive number biased serialization of much use, or is it more
> trouble than its worth?

I did spend a few days days trying to define a positive-biased integer 
format that was sufficiently simple to describe and efficient to 
implement, and was unable to.  That's not to say it's impossible.

> In general, having a couple serialization types for other currently
> fixed-only size types might be useful as well.  "encoding":
> "bias-unsigned"? Given the knowledge of the data stored I can see
> some variation in what a user would want for, float, double, string,
> datetime, or int WRT space / time tradeoffs.

We need to strike a reasonable balance between simplicity and maximal 
performance for every case.  When things are unclear, I tend to opt for 
simplicity over a potential minor performance improvement.  A biased 
representation might save a few percent in size for some applications, 
but at the cost of forcing every implementation in every language to 
support that encoding.  Avro's about interchange between applications, 
and one might need to make some compromises over what's ideal for each 

> I don't think that a first version of Avro should do much work on
> this front.  But I would like it to not preclude future options or
> extensibility on the encoding side of things per data type. In the
> distant future, Strings could even have an encoding type that
> compresses by spanning across records of the same column in stream
> formats -- (which is often much more efficient than compressing the
> whole row since column similarity - think URLs - can be very high).

Such optimizations can be implemented in other ways too, complementary 
to Avro.  For example, a container might store a sequence of Avro 
records that contain <int,string> pairs, the int indicating how much of 
the previous record's string is a prefix of the current, and the string 
providing the suffix.  The container's API could then decode these to 
the full string.


View raw message