avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stu Hood (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-679) Improved encodings for arrays
Date Mon, 11 Oct 2010 16:21:33 GMT

    [ https://issues.apache.org/jira/browse/AVRO-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919886#action_12919886

Stu Hood commented on AVRO-679:

> Adding a new fundamental type or encoding is hard to do compatibly.
Agreed: but this particular optimization is only possible with Avro's support, and opens up
a lot of other interesting possibilities. For instance, in your prefix encoding example, encoding
a block of <int,string,long> as a record {{array<int>, array<long>, array<string>}}
might give a 3-6x increase in decode speed (based on the numbers suggested in the link).

It is worth considering how the specification can evolve backwards compatibly as well: perhaps
the next revision of the specification could require a magical 'spec revision' number to be
present in all schemas, and would assume that a schema that is missing the rev number is a
legacy format? This would allow readers and writers to communicate across spec revision boundaries
by disabling optimizations/encodings that the other side does not support.

> One might automatically rewrite schemas and have a layer that transforms datastructures
Yea: there is probably room for a schema translation layer above Avro for things like RLE
/ prefix encoding, but I think it is a separate area of focus.

> Improved encodings for arrays
> -----------------------------
>                 Key: AVRO-679
>                 URL: https://issues.apache.org/jira/browse/AVRO-679
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Stu Hood
>            Priority: Minor
> There are better ways to encode arrays of varints [1] which are faster to decode, and
more space efficient than encoding varints independently.
> Extending the idea to other types of variable length data like 'bytes' and 'string',
you could encode the entries for an array block as an array of lengths, followed by contiguous
byte/utf8 data.
> [1] group varint encoding: slides 57-63 of http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message