avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yibing Shi <y...@cloudera.com>
Subject Re: setting default values in avro
Date Fri, 08 Jul 2016 12:51:10 GMT
+ Sean Busbey

My understanding is this problem is a limitation of Python AVRO library.
Currently it seems that the only valid default value is "null". Please try
below schema to see whether it works for you.

{
*    "type" : "record",*
*    "name" : "data",*
*    "namespace" : "my.example",*
*    "fields" : [*
*        {"name" : "domain", "type" : ["null", "string"], "default" :
null},*
*        {"name" : "ip", "type" : ["null", "string"], "default" : null},*
*        {"name" : "port", "type" : ["null", "int"], "default" : null},*
*        {"name" : "score", "type" : ["null", "int"], "default" : null}*
*    ]*
*}*

Below JIRAs seems to be related:

https://issues.apache.org/jira/browse/AVRO-1265
https://issues.apache.org/jira/browse/AVRO-1566

I am pretty sure that the AVRO Java library supports using a non-null
default value for record fields. You can try it in a Java program.


*Yibing Shi*
*Customer Operations Engineer*
<http://www.cloudera.com>

On Fri, Jul 8, 2016 at 3:00 PM, Stanislav Savulchik <s.savulchik@gmail.com>
wrote:

> I'm not familiar with Avro good enough to propose an "Avro solution" for
> your problem :(
>
> If you want to serialize default values into Avro for some fields you
> should provide the default values in code explicitly when writing to Avro.
> Another approach is to declare the fields as nullable using union types
> (e.g. [null, int]) and use default values in code explicitly when reading
> from Avro.
>
> I believe the "default" key you used in Avro schema is meant for schema
> evolution http://avro.apache.org/docs/current/spec.html#Schema+Resolution
>
>    - if the reader's record schema has a field that contains a default
>    value, and writer's schema does not have a field with the same name, then
>    the reader should use the default value from its field.
>
>
> пт, 8 июл. 2016 г. в 9:52, Sarvagya Pant <sarvagya.pant@gmail.com>:
>
>> Hi Stanislav,
>>
>> Thanks for the reply. What I want to achieve is that data arriving in
>> Avro writer may not contain all field as specified in the example above. I
>> would like to save default value if possible or retrieve the default value
>> when using DataFileReader. Is this possible? Should the data always contain
>> all the keys specified in the schema. I tried using ["int", "null"],
>> "default" : 0, but this was able to save the data if any field is not
>> present, but using DataFileReader I got None instead of default value 0.
>> Any help will be much appreciated. Thanks.
>>
>> On Thu, Jul 7, 2016 at 10:39 PM, Stanislav Savulchik <
>> s.savulchik@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I believe default values only work for readers, not writers.
>>>
>>> Spec says that (http://avro.apache.org/docs/current/spec.html):
>>> > default: A default value for this field, used when reading instances
>>> that lack this field (optional).
>>>
>>> On 7 июля 2016 г., at 21:16, Sarvagya Pant <sarvagya.pant@gmail.com>
>>> wrote:
>>>
>>> I am trying to implement Avro to replace some codes that tries to write
>>> data in CSV. This is because CSV cannot store the type of the field and all
>>> data are treated as string when trying to consume. I have copied the code
>>> for Avro from its website and would like to set a default value if there is
>>> no field.
>>>
>>> My avro file looks like this:
>>>
>>> {
>>>     "type" : "record",
>>>     "name" : "data",
>>>     "namespace" : "my.example",
>>>     "fields" : [
>>>         {"name" : "domain", "type" : "string", "default" : "EMPTY"},
>>>         {"name" : "ip", "type" : "string", "default" : "EMPTY"},
>>>         {"name" : "port", "type" : "int", "default" : 0},
>>>         {"name" : "score", "type" : "int", "default" : 0}
>>>     ]
>>> }
>>>
>>> I have written a simple python file that is expected to work. It is
>>> given below:
>>>
>>> import avro.schema
>>> from avro.datafile import DataFileReader, DataFileWriter
>>> from avro.io import DatumReader, DatumWriter
>>>
>>> schema = avro.schema.parse(open("data.avsc", "rb").read())
>>>
>>> writer = DataFileWriter(open("users.avro", "w"), DatumWriter(), schema)
>>> writer.append({"domain": "hello domain", "score" : 20, "port" : 8080})
>>> writer.append({"ip": "1.2.3.4", "port" : 80})
>>> writer.append({"domain": "another domain", "score" : 100})
>>> writer.close()
>>>
>>> reader = DataFileReader(open("users.avro", "rb"), DatumReader())
>>> for data in reader:
>>>     print data
>>> reader.close()
>>>
>>> However, if I try to run this program, I get error that data are not
>>> mapped according to schema.
>>>
>>>     Traceback (most recent call last):
>>>   File "D:\arko.py", line 8, in <module>
>>>     writer.append({"domain": "hello domain", "score" : 20, "port" :
>>> 8080})
>>>   File "build\bdist.win32\egg\avro\datafile.py", line 196, in append
>>>   File "build\bdist.win32\egg\avro\io.py", line 769, in write
>>>
>>> avro.io.AvroTypeException: The datum {'domain': 'hello domain', 'score':
>>> 20, 'port': 8080} is not an example of the schema {
>>>   "namespace": "my.example",
>>>   "type": "record",
>>>   "name": "userInfo",
>>>   "fields": [
>>>     {
>>>       "default": "EMPTY",
>>>       "type": "string",
>>>       "name": "domain"
>>>     },
>>>     {
>>>       "default": "EMPTY",
>>>       "type": "string",
>>>       "name": "ip"
>>>     },
>>>     {
>>>       "default": 0,
>>>       "type": "int",
>>>       "name": "port"
>>>     },
>>>     {
>>>       "default": 0,
>>>       "type": "int",
>>>       "name": "score"
>>>     }
>>>   ]
>>> }
>>> [Finished in 0.1s with exit code 1]
>>>
>>> I am using avro v1.8.0 and python 2.7. What am I doing wrong here?
>>> Thanks.
>>>
>>> --
>>>
>>> *Sarvagya Pant*
>>> *Kathmandu, Nepal*
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Sarvagya Pant*
>> *Kathmandu, Nepal*
>>
>

Mime
View raw message