avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Justin SB (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-196) Add encoding for sparse records
Date Tue, 17 Nov 2009 01:17:39 GMT

    [ https://issues.apache.org/jira/browse/AVRO-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778676#action_12778676
] 

Justin SB commented on AVRO-196:
--------------------------------

You're probably right that this is too big a change for avro in its early stages.  In my use
case I was storing floats, but I've switched to storing ints instead, so an empty value is
now 7 extra bits instead of 31.

Perhaps we should see what can be achieved through compression first (AVRO-135).  I'd like
to see a per-record compression option, and I'd also like to have empty values compress well.
 I think as long as we choose an algorithm where consecutive zeroes are highly compressed,
compression would solve the issue here, while also being more generally applicable.

> Add encoding for sparse records
> -------------------------------
>
>                 Key: AVRO-196
>                 URL: https://issues.apache.org/jira/browse/AVRO-196
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Justin SB
>            Priority: Minor
>
> If we have a large record with many fields in avro which is mostly empty, currently avro
will still serialize every field, leading to big overhead.  We could support a sparse record
format for this case: before each record a bitmask is serialized indicating the presence of
the fields.  We could specify the encoding type as a new attribute in the avpr e.g.  {"type":"record",
"name":"Test", "encoding":"sparse", "fields":....}
> I've put an implementation of the idea on github:
> http://github.com/justinsb/avro/commit/7f6ad2532298127fcdd9f52ce90df21ff527f9d1
> This leads to big improvements in the serialization size in our case, when we're using
avro to serialize performance metrics, where most of the fields are usually empty.
> The alternative of using a Map isn't a good idea because it (1) serializes the names
of the fields and (2) means we lose strong typing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message