avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thiruvalluvan M. G. (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-196) Add encoding for sparse records
Date Wed, 13 Jan 2010 13:36:54 GMT

    [ https://issues.apache.org/jira/browse/AVRO-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799743#action_12799743

Thiruvalluvan M. G. commented on AVRO-196:

I guess, you are storing a sentinel value (say 0) to indicate that the value is absent. We
have an idiom in Avro that an optional field of type T is represented as [null, T] (a union
of null and T). If you go that way, each field that is absent will be encoded as null. Null
itself does not take any space in Avro format, but the union branch will take 1 byte. This
approach has two advantages over yours:
   - This is general and can be used for fields of any type. The cost is one byte irrespective
of the size of the field.
   - You don't need a sentinel value

The disadvantages of this are:
   - This is not quite as efficient as bit fields; it takes one byte per field.
   - The one-byte overhead applies irrespective of the presence or absence of the field.

> Add encoding for sparse records
> -------------------------------
>                 Key: AVRO-196
>                 URL: https://issues.apache.org/jira/browse/AVRO-196
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Justin SB
>            Priority: Minor
> If we have a large record with many fields in avro which is mostly empty, currently avro
will still serialize every field, leading to big overhead.  We could support a sparse record
format for this case: before each record a bitmask is serialized indicating the presence of
the fields.  We could specify the encoding type as a new attribute in the avpr e.g.  {"type":"record",
"name":"Test", "encoding":"sparse", "fields":....}
> I've put an implementation of the idea on github:
> http://github.com/justinsb/avro/commit/7f6ad2532298127fcdd9f52ce90df21ff527f9d1
> This leads to big improvements in the serialization size in our case, when we're using
avro to serialize performance metrics, where most of the fields are usually empty.
> The alternative of using a Map isn't a good idea because it (1) serializes the names
of the fields and (2) means we lose strong typing.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message