avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: what's the efficiency difference between type: "string" and ["string", "null"]
Date Fri, 14 Mar 2014 17:49:15 GMT
One small note: the best practice is to place "null" first when it's
in a union.  This is because the type of a default value for a union
field is the type of the first element of the union, and null is the
most commonly used default value for unions with null.  So the idiom
for a field that defaults to null is:

{"name",<<field name>>,"type":["null",<<field type>>],"default":null}

I've updated the specification to clarify this point.

https://issues.apache.org/jira/browse/AVRO-1482

Doug

On Fri, Mar 14, 2014 at 1:56 AM, Bertrand Dechoux <dechouxb@gmail.com> wrote:
> I think the specification is clear about that.
>
>> Unions
>> A union is encoded by first writing a long value indicating the zero-based
>> position within the union of the schema of its value. The value is then
>> encoded per the indicated schema within the union.
>> For example, the union schema ["string","null"] would encode:
>>
>> null as the integer 1 (the index of "null" in the union, encoded as hex
>> 02):
>>
>> 02
>>
>> the string "a" as zero (the index of "string" in the union), followed by
>> the serialized string:
>>
>> 00 02 61
>
>
> http://avro.apache.org/docs/1.7.6/spec.html
>
> So there is an overhead but that may not be the main issue.
>
> The issue might be more about defining a correct schema. If a field can be
> null then all clients should handle the case when the field is indeed null.
> That's a 'hygiene issue' (or data quality issue if your prefer), like with a
> database schema.
>
> Regards
>
> Bertrand
>
> Bertrand Dechoux
>
>
> On Fri, Mar 14, 2014 at 9:15 AM, Fengyun RAO <raofengyun@gmail.com> wrote:
>>
>> I have some string fields which may be null, while some definitely not
>> null.
>> The problem is that it takes time to distinguish them.
>> There are about 100 fields, 50 of which are string,  10 of which I guess
>> could be null.
>>
>> Could I just specify all string types ["string", "null"],
>> how much is the efficiency difference?
>>
>>
>

Mime
View raw message