avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Plevyak (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-519) Efficient sparse optional fields support
Date Thu, 22 Apr 2010 16:41:51 GMT

    [ https://issues.apache.org/jira/browse/AVRO-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859878#action_12859878
] 

John Plevyak commented on AVRO-519:
-----------------------------------

Doug,  your proposed solution is made somewhat more complex by the fact that it is not possible
to associate a name
with types other than records, fixed and enum within a union.  One might want to do:

{
    "type" : "array",
    "name" : "optionals",
    "items" : [
       { "name" : "a", "type" : "bytes" },
       { "name" : "b", "type" : "bytes" }
    ]
}

which the C++ translator accepts but for which it nevertheless generates incorrect code (I
will file a bug).

As it stands, one would have to do:

{
    "type" : "array",
    "name" : "optionals",
    "items" : [
       { "name" : "l", "type" : "record", "fields" : [ { "name" : "l", "type": "long"} ] },
       { "name" : "r", "type" : "record", "fields" : [ { "name" : "r", "type": "long"} ] }
    ]
}

which is workable, albeit more complicated than one might want.  What is the rational for
not permitting a name
to be associated with other types in a union?

> Efficient sparse optional fields support
> ----------------------------------------
>
>                 Key: AVRO-519
>                 URL: https://issues.apache.org/jira/browse/AVRO-519
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: John Plevyak
>
> One of the nice features of protobuf is efficient support for very sparse optional fields,
> for example large number of tags potentially associated with a document the vast
> majority of which are empty.
> Avro does support optional fields as part of differing specifications, but not on a per-record
> level after a protocol has been agreed upon.  Avro does have support for arrays and maps
> however both of these require homogeneous types.
> I would suggest adding an additional field attribute:
>    * "optional" - with values "true"/"false" (where "false" is assumed)
> For the encoding I would suggest that that any record which includes optional fields
> would be prefixed by an presence map which would be a sequence of int8 x* where:
>   x > 0 : the lower 7 bits are presence bits for the next 7 optional fields (low bit
first)
>   -128 < x < 0 : the next present field is position x + 135 (as x runs from 0 to
-127 and the first 7
>               must be empty otherwise we would use the x > 0 encoding) 
>   x == -128: no optional fields present in the next 134 optional fields
>   x = 0 : end of sequence
>   further, if the map has covered all the options, the end-of-sequence marker can be
>   elided.  For example, a type with 3 optional fields would require only a single byte.

> This will permit encoding at 8/7 of a bit per present entry (worst case) and at a cost
of
> 8/134 (0.06) bits/entry per all but last not-present (7.5 bytes / 1000 optional fields).
> This encoding is backward compatible as well as schema's which do not contain optional
> elements do not have the presence map and the encoding is therefore identical.  Backward
> compatibility can be maintained by simply using the default value for not-present fields.
> Language APIs:
> Efficient support could include either an explicit presence test or a function which
returns the value
> or default value (if the field is not present).
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message