avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Strickler <mar...@braindump.ms>
Subject Re: Converting arbitrary JSON to avro
Date Wed, 19 Sep 2012 15:44:37 GMT
Hi Doug,

thanks for the suggestion, I wasn't aware that one can specify anything else than a record
as top level element of a schema. 
I tried this, and it works well for a flat data, but with nested structures you still have
to use the value indirection, or so it seems.
Also I think this change might break existing code relying on the current structure.

As far as size and performance go, I probably have to run some tests on real data, once I've
come up with an appropriate schema that actually matches the data.

Again, thanks a lot for your help.


Am 18.09.2012 um 21:38 schrieb Doug Cutting:

> On Tue, Sep 18, 2012 at 11:34 AM, Markus Strickler <markus@braindump.ms> wrote:
>> Json.Writer is indeed what I had in mind and I have successfully managed to convert
my existing JSON to avro using it.
>> However using GenericDatumReader on this feels pretty unnatural, as I seem to be
unable to access fields directly. It seems I have to access the "value" field on each record
which returns a Map which uses Utf8 Objects as keys for the actual fields. Or am I doing something
wrong here?
> Hmm.  We could re-factor Json.SCHEMA so the union is the top-level
> element.  That would get rid of the wrapper around every value.  It's
> a more redundant way to write the schema, but the binary encoding is
> identical (since a record wrapper adds no bytes).  It would hence
> require no changes to Json.Reader or Json.Writer.
> [ "long",
>  "double",
>  "string",
>  "boolean",
>  "null",
>  {"type" : "array",
>   "items" : {
>       "type" : "record",
>       "name" : "org.apache.avro.data.Json",
>       "fields" : [ {
>           "name" : "value",
>           "type" : [ "long", "double", "string", "boolean", "null",
>                      {"type" : "array", "items" : "Json"},
>                      {"type" : "map", "values" : "Json"}
>                    ]
>       } ]
>   }
>  },
>  {"type" : "map", "values" : "Json"}
> ]
> You can try this by placing this schema in
> share/schemas/org/apache/avro/data/Json.avsc and re-building the avro
> jar.
> Would such a change be useful to you?  If so, please file an issue in Jira.
> Or we could even refactor this schema so that a Json object is the
> top-level structure:
> {"type" : "map",
> "values" : [ "long",
>              "double",
>              "string",
>              "boolean",
>              "null",
>              {"type" : "array",
>               "items" : {
>                   "type" : "record",
>                   "name" : "org.apache.avro.data.Json",
>                   "fields" : [ {
>                       "name" : "value",
>                       "type" : [ "long", "double", "string", "boolean", "null",
>                                  {"type" : "array", "items" : "Json"},
>                                  {"type" : "map", "values" : "Json"}
>                                ]
>                   } ]
>               }
>              },
>              {"type" : "map", "values" : "Json"}
>            ]
> }
> This would change the binary format but would not change the
> representation that GenericDatumReader would hand you from my first
> example above (since the generic representation unwraps unions).
> Using this schema would require changes to Json.Writer and
> Json.Reader.  It would better conform to the definition of Json, which
> only permits objects as the top-level type.
>> Concerning the more specific schema, you are of course completely right. Unfortunately
more or less all the fields in the JSON data format are optional and many have substructures,
so, at least in my understanding, I have to use unions of null and the actual type throughout
the schema. I tried using JsonDecoder first (or rather the fromjson option of the avro tool,
which, I think, uses JsonDecoder) but given the current JSON structures, this didn't work.
>> So I'll probably have to look into implementing my own converter.  However given
the rather complex structure of the original JSON I'm wondering if trying to represent the
data in avro is such a good idea in the first place.
> It would be interesting to see whether, with the appropriate schema,
> whether the dataset is smaller and faster to process as Avro than as
> Json.  If you have 1000 fields in your data but the typical record
> only has one or two non-null, then an Avro record is perhaps not a
> good representation.  An Avro map might be better, but if the values
> are similarly variable then Json might be competitive.
> Cheers,
> Doug

View raw message