avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip Zeyliger (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-196) Add encoding for sparse records
Date Fri, 13 Nov 2009 20:16:39 GMT

    [ https://issues.apache.org/jira/browse/AVRO-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777633#action_12777633

Philip Zeyliger commented on AVRO-196:

I'm thinking about ways to approach this without changing the serialized form so much.  There
are two ways to model this:
* For fields f1 through f10, create records r1 through r10.  Then create u=union(r1, ...,
r10) and store list(u).  The union costs you one byte per item (I think), so it's more expensive
than the bitset.
* You could also just implement that sparseness manually.  You store "bytes" first, followed
by list(int).  When you do the deserialization in your application, you interpret the bitset
manually.  (BTW, check out Java's BitSet class.)

In general, there are often tighter data-structures for specific applications, and AVRO is
unlikely to support all of them.  For example, if you want to store a map with complex keys
(or even a sorted map), you have to store pairs, and create the map on the application.  If
you're storing a timeseries, you might want to store only the deltas relative to the previous
values, and interpret that at run-time.  I think having to do the application-specific logic
in wrapper classes is pretty normal.

> Add encoding for sparse records
> -------------------------------
>                 Key: AVRO-196
>                 URL: https://issues.apache.org/jira/browse/AVRO-196
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Justin SB
>            Priority: Minor
> If we have a large record with many fields in avro which is mostly empty, currently avro
will still serialize every field, leading to big overhead.  We could support a sparse record
format for this case: before each record a bitmask is serialized indicating the presence of
the fields.  We could specify the encoding type as a new attribute in the avpr e.g.  {"type":"record",
"name":"Test", "encoding":"sparse", "fields":....}
> I've put an implementation of the idea on github:
> http://github.com/justinsb/avro/commit/7f6ad2532298127fcdd9f52ce90df21ff527f9d1
> This leads to big improvements in the serialization size in our case, when we're using
avro to serialize performance metrics, where most of the fields are usually empty.
> The alternative of using a Map isn't a good idea because it (1) serializes the names
of the fields and (2) means we lose strong typing.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message