avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Justin SB (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-196) Add encoding for sparse records
Date Tue, 17 Nov 2009 01:17:39 GMT

    [ https://issues.apache.org/jira/browse/AVRO-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778676#action_12778676

Justin SB commented on AVRO-196:

You're probably right that this is too big a change for avro in its early stages.  In my use
case I was storing floats, but I've switched to storing ints instead, so an empty value is
now 7 extra bits instead of 31.

Perhaps we should see what can be achieved through compression first (AVRO-135).  I'd like
to see a per-record compression option, and I'd also like to have empty values compress well.
 I think as long as we choose an algorithm where consecutive zeroes are highly compressed,
compression would solve the issue here, while also being more generally applicable.

> Add encoding for sparse records
> -------------------------------
>                 Key: AVRO-196
>                 URL: https://issues.apache.org/jira/browse/AVRO-196
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Justin SB
>            Priority: Minor
> If we have a large record with many fields in avro which is mostly empty, currently avro
will still serialize every field, leading to big overhead.  We could support a sparse record
format for this case: before each record a bitmask is serialized indicating the presence of
the fields.  We could specify the encoding type as a new attribute in the avpr e.g.  {"type":"record",
"name":"Test", "encoding":"sparse", "fields":....}
> I've put an implementation of the idea on github:
> http://github.com/justinsb/avro/commit/7f6ad2532298127fcdd9f52ce90df21ff527f9d1
> This leads to big improvements in the serialization size in our case, when we're using
avro to serialize performance metrics, where most of the fields are usually empty.
> The alternative of using a Map isn't a good idea because it (1) serializes the names
of the fields and (2) means we lose strong typing.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message