Return-Path: Delivered-To: apmail-hadoop-avro-dev-archive@minotaur.apache.org Received: (qmail 45787 invoked from network); 22 Apr 2010 16:42:15 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Apr 2010 16:42:15 -0000 Received: (qmail 94625 invoked by uid 500); 22 Apr 2010 16:42:13 -0000 Delivered-To: apmail-hadoop-avro-dev-archive@hadoop.apache.org Received: (qmail 94571 invoked by uid 500); 22 Apr 2010 16:42:13 -0000 Mailing-List: contact avro-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: avro-dev@hadoop.apache.org Delivered-To: mailing list avro-dev@hadoop.apache.org Received: (qmail 94469 invoked by uid 99); 22 Apr 2010 16:42:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Apr 2010 16:42:13 +0000 X-ASF-Spam-Status: No, hits=-1331.9 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Apr 2010 16:42:11 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o3MGfpDk017431 for ; Thu, 22 Apr 2010 16:41:51 GMT Message-ID: <28213422.141501271954511274.JavaMail.jira@thor> Date: Thu, 22 Apr 2010 12:41:51 -0400 (EDT) From: "John Plevyak (JIRA)" To: avro-dev@hadoop.apache.org Subject: [jira] Commented: (AVRO-519) Efficient sparse optional fields support In-Reply-To: <21775054.27121271528244817.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/AVRO-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859878#action_12859878 ] John Plevyak commented on AVRO-519: ----------------------------------- Doug, your proposed solution is made somewhat more complex by the fact that it is not possible to associate a name with types other than records, fixed and enum within a union. One might want to do: { "type" : "array", "name" : "optionals", "items" : [ { "name" : "a", "type" : "bytes" }, { "name" : "b", "type" : "bytes" } ] } which the C++ translator accepts but for which it nevertheless generates incorrect code (I will file a bug). As it stands, one would have to do: { "type" : "array", "name" : "optionals", "items" : [ { "name" : "l", "type" : "record", "fields" : [ { "name" : "l", "type": "long"} ] }, { "name" : "r", "type" : "record", "fields" : [ { "name" : "r", "type": "long"} ] } ] } which is workable, albeit more complicated than one might want. What is the rational for not permitting a name to be associated with other types in a union? > Efficient sparse optional fields support > ---------------------------------------- > > Key: AVRO-519 > URL: https://issues.apache.org/jira/browse/AVRO-519 > Project: Avro > Issue Type: New Feature > Components: spec > Reporter: John Plevyak > > One of the nice features of protobuf is efficient support for very sparse optional fields, > for example large number of tags potentially associated with a document the vast > majority of which are empty. > Avro does support optional fields as part of differing specifications, but not on a per-record > level after a protocol has been agreed upon. Avro does have support for arrays and maps > however both of these require homogeneous types. > I would suggest adding an additional field attribute: > * "optional" - with values "true"/"false" (where "false" is assumed) > For the encoding I would suggest that that any record which includes optional fields > would be prefixed by an presence map which would be a sequence of int8 x* where: > x > 0 : the lower 7 bits are presence bits for the next 7 optional fields (low bit first) > -128 < x < 0 : the next present field is position x + 135 (as x runs from 0 to -127 and the first 7 > must be empty otherwise we would use the x > 0 encoding) > x == -128: no optional fields present in the next 134 optional fields > x = 0 : end of sequence > further, if the map has covered all the options, the end-of-sequence marker can be > elided. For example, a type with 3 optional fields would require only a single byte. > This will permit encoding at 8/7 of a bit per present entry (worst case) and at a cost of > 8/134 (0.06) bits/entry per all but last not-present (7.5 bytes / 1000 optional fields). > This encoding is backward compatible as well as schema's which do not contain optional > elements do not have the presence map and the encoding is therefore identical. Backward > compatibility can be maintained by simply using the default value for not-present fields. > Language APIs: > Efficient support could include either an explicit presence test or a function which returns the value > or default value (if the field is not present). > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.