Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B3725D819 for ; Tue, 18 Sep 2012 23:18:59 +0000 (UTC) Received: (qmail 47095 invoked by uid 500); 18 Sep 2012 23:18:59 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 47044 invoked by uid 500); 18 Sep 2012 23:18:59 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 47036 invoked by uid 99); 18 Sep 2012 23:18:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Sep 2012 23:18:59 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of russell.jurney@gmail.com designates 209.85.220.171 as permitted sender) Received: from [209.85.220.171] (HELO mail-vc0-f171.google.com) (209.85.220.171) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Sep 2012 23:18:54 +0000 Received: by vcmm18 with SMTP id m18so647585vcm.30 for ; Tue, 18 Sep 2012 16:18:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=references:from:in-reply-to:mime-version:date:message-id:subject:to :content-type:content-transfer-encoding; bh=2UmaLADGs/A3ks77CyBp9cHTAmy28z1VDK0zjcDxAJ8=; b=q32H0NsXnRfOmbZK4ONfrVuEtrVwlVQURvPnkjrZF2fQV0nIauZsE+4O+LRjdTjgvR giWbRXQHLJ46L+HAR3hVx/SmyYmpF/RTDdLFoIA5kW3tlpE8N5hH03Y7+wffQuZW0wU2 qdAqJ/JYlcAkBbLMGMbdtetIIU92hdSFDmTy3OAccLu/N4UomgEOKaIBxBFENFtvCW12 TlrbOKzMFWlfdov5yU0+Gwh64Zs+szrDuUQnZbhjj+EB6mOsmlB0G6O2Jd3JgK+X8CZt X7cgj06rfRnq/cQI4WObglBD4//LqUc5sg7sgsI8rOVLAEj3K+4rMgpwoxhmXJQElmap NFUA== Received: by 10.52.22.37 with SMTP id a5mr726257vdf.60.1348010313671; Tue, 18 Sep 2012 16:18:33 -0700 (PDT) References: <445E44A6-345F-42FC-B340-48AF3804A93D@braindump.ms> <65FA8140-FBFF-4EEA-A607-F00620DE2AAD@braindump.ms> From: Russell Jurney In-Reply-To: Mime-Version: 1.0 (1.0) Date: Tue, 18 Sep 2012 16:18:30 -0700 Message-ID: <-2429953882573608962@unknownmsgid> Subject: Re: Converting arbitrary JSON to avro To: "user@avro.apache.org" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Fwiw, I do this in web apps all the time via the python avro lib and json.d= umps Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com On Sep 18, 2012, at 12:38 PM, Doug Cutting wrote: > On Tue, Sep 18, 2012 at 11:34 AM, Markus Strickler = wrote: >> Json.Writer is indeed what I had in mind and I have successfully managed= to convert my existing JSON to avro using it. >> However using GenericDatumReader on this feels pretty unnatural, as I se= em to be unable to access fields directly. It seems I have to access the "v= alue" field on each record which returns a Map which uses Utf8 Objects as k= eys for the actual fields. Or am I doing something wrong here? > > Hmm. We could re-factor Json.SCHEMA so the union is the top-level > element. That would get rid of the wrapper around every value. It's > a more redundant way to write the schema, but the binary encoding is > identical (since a record wrapper adds no bytes). It would hence > require no changes to Json.Reader or Json.Writer. > > [ "long", > "double", > "string", > "boolean", > "null", > {"type" : "array", > "items" : { > "type" : "record", > "name" : "org.apache.avro.data.Json", > "fields" : [ { > "name" : "value", > "type" : [ "long", "double", "string", "boolean", "null", > {"type" : "array", "items" : "Json"}, > {"type" : "map", "values" : "Json"} > ] > } ] > } > }, > {"type" : "map", "values" : "Json"} > ] > > You can try this by placing this schema in > share/schemas/org/apache/avro/data/Json.avsc and re-building the avro > jar. > > Would such a change be useful to you? If so, please file an issue in Jir= a. > > Or we could even refactor this schema so that a Json object is the > top-level structure: > > {"type" : "map", > "values" : [ "long", > "double", > "string", > "boolean", > "null", > {"type" : "array", > "items" : { > "type" : "record", > "name" : "org.apache.avro.data.Json", > "fields" : [ { > "name" : "value", > "type" : [ "long", "double", "string", "boolean", "= null", > {"type" : "array", "items" : "Json"}, > {"type" : "map", "values" : "Json"} > ] > } ] > } > }, > {"type" : "map", "values" : "Json"} > ] > } > > This would change the binary format but would not change the > representation that GenericDatumReader would hand you from my first > example above (since the generic representation unwraps unions). > Using this schema would require changes to Json.Writer and > Json.Reader. It would better conform to the definition of Json, which > only permits objects as the top-level type. > >> Concerning the more specific schema, you are of course completely right.= Unfortunately more or less all the fields in the JSON data format are opti= onal and many have substructures, so, at least in my understanding, I have = to use unions of null and the actual type throughout the schema. I tried us= ing JsonDecoder first (or rather the fromjson option of the avro tool, whic= h, I think, uses JsonDecoder) but given the current JSON structures, this d= idn't work. > >> So I'll probably have to look into implementing my own converter. Howev= er given the rather complex structure of the original JSON I'm wondering if= trying to represent the data in avro is such a good idea in the first plac= e. > > It would be interesting to see whether, with the appropriate schema, > whether the dataset is smaller and faster to process as Avro than as > Json. If you have 1000 fields in your data but the typical record > only has one or two non-null, then an Avro record is perhaps not a > good representation. An Avro map might be better, but if the values > are similarly variable then Json might be competitive. > > Cheers, > > Doug