Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DCC0DD355 for ; Mon, 5 Nov 2012 18:46:30 +0000 (UTC) Received: (qmail 2277 invoked by uid 500); 5 Nov 2012 18:46:30 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 2172 invoked by uid 500); 5 Nov 2012 18:46:29 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 2163 invoked by uid 99); 5 Nov 2012 18:46:29 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Nov 2012 18:46:29 +0000 Received: from localhost (HELO mail-lb0-f171.google.com) (127.0.0.1) (smtp-auth username cutting, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Nov 2012 18:46:29 +0000 Received: by mail-lb0-f171.google.com with SMTP id m4so5299381lbo.30 for ; Mon, 05 Nov 2012 10:46:27 -0800 (PST) MIME-Version: 1.0 Received: by 10.112.14.107 with SMTP id o11mr4315538lbc.98.1352141187161; Mon, 05 Nov 2012 10:46:27 -0800 (PST) Received: by 10.112.85.197 with HTTP; Mon, 5 Nov 2012 10:46:27 -0800 (PST) In-Reply-To: References: Date: Mon, 5 Nov 2012 10:46:27 -0800 Message-ID: Subject: Re: Schema validation of a field's default values From: Doug Cutting To: user@avro.apache.org Content-Type: multipart/alternative; boundary=f46d0401689b259fa604cdc3e71e --f46d0401689b259fa604cdc3e71e Content-Type: text/plain; charset=UTF-8 Mark, I'd welcome improvements to default value validation in Avro. For performance, I think this should be an explicit, separate operation from parsing schemas. But we might invoke it on schemas at various points, e.g., when creating a file. If you are able, please contribute your implementation by filing an issue in Avro's Jira. Thanks, Doug On Sat, Nov 3, 2012 at 9:48 AM, Mark Hayes wrote: > On Mon, Oct 29, 2012 at 12:32 PM, Doug Cutting wrote: > >> No, I don't know of a default value validator that's been implemented >> yet. It would be great to have one. >> >> I think this would recursively walk a schema. Whenever a non-null >> default value is found it could call ResolvingGrammarDecoder#encode(). >> That's what interprets Json default values. (Perhaps this logic >> should be moved, though.) > > > Thanks for the reply Doug. > > I did find ResolvingGrammarDecoder.encode (I saw that it is called by the > builders) and was using it as you described, but I ran into limitations: > > + When the field type is an array, map or record, values of the > wrong JSON type (not array or object) are translated to an empty array, > map or record. For example, specifying a default of 0, null or "" results > in an empty array, map or record. > > + For all numeric Avro types (int, long, float and double) the default > value may be of any JSON numeric type, and the JSON values will be coerced > to the Avro type in spite of the fact that part of the value may be > lost/truncated. For example, a long default value that exceeds 32-bits > will be truncated if the field is type int. > > + The byte array length is not validated for a fixed type. > > + For nested fields and certain types (e.g., enums) a cryptic error > is often output that does not contain the name of the offending field. > > These deficiencies can mask errors made by the user when defining > a default value. This is important to our application. > > To compensate for these deficiencies we implemented our own checking that > is more strict than Avro's. To do this, we serialize the default value > using our own JSON serializer in a special mode where default values are > applied. Any errors during serialization indicate that the default value > is invalid. > > Something similar might be done in Avro itself, for example, if the JSON > encoder were made to operate in a special mode where default values are > applied. > > --mark > --f46d0401689b259fa604cdc3e71e Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Mark,

I'd welcome improvements to default value vali= dation in Avro. =C2=A0For performance, I think this should be an explicit, = separate operation from parsing schemas. =C2=A0But we might invoke it on sc= hemas at various points, e.g., when creating a file. =C2=A0If you are able,= please contribute your implementation by filing an issue in Avro's Jir= a.

Thanks,

Doug


On Sat, Nov 3, 2012 at = 9:48 AM, Mark Hayes <mark@greybird.com> wrote:
On Mon, Oct 29, 2012 at 12= :32 PM, Doug Cutting <cutting@apache.org> wrote:
No, I don't know of a default value validator that's been impl= emented
yet. =C2=A0It would be great to have one.

I think this would recursively walk a schema. =C2=A0Whenever a non-null
default value is found it could call ResolvingGrammarDecoder#encode().
=C2=A0That's what interprets Json default values. =C2=A0(Perhaps this l= ogic
should be moved, though.)

Thanks for = the reply Doug.

I did find=C2=A0ResolvingGrammarDe= coder.encode (I saw that it is called by the builders) and was using it as = you described, but I ran into limitations:

+ =C2=A0When the field type is an array, map or re= cord, values of the wrong=C2=A0JSON type (not array or object) are translat= ed to an empty array, map=C2=A0or record. =C2=A0For example, specifying a d= efault of 0, null or "" results in an empty array, map or record.=

+ For all numeric Avro types (int, long, float and doub= le) the default=C2=A0 value may be of any JSON numeric type, and the JSON v= alues will be=C2=A0coerced to the Avro type in spite of the fact that part = of the value=C2=A0may be lost/truncated. =C2=A0For example, a long default = value that exceeds 32-bits will be truncated if the field is type int.

+ The byte array length is not validated for a fixed ty= pe.

+ For nested fields and certain types (e.g., e= nums) a cryptic error is=C2=A0often output that does not contain the name o= f the offending field.

These deficiencies can mask errors made by the user whe= n defining a=C2=A0default value. =C2=A0This is important to our application= .

To compensate for these deficiencies we implemen= ted our=C2=A0own checking that is more strict than Avro's. =C2=A0To do = this, we serialize=C2=A0the default value using our own JSON serializer in = a special mode where=C2=A0default values are applied. =C2=A0Any errors duri= ng serialization indicate=C2=A0that the default value is invalid.

Something similar might be done in Avro itself, f= or example, if the JSON encoder were made to operate in a special mode wher= e default values are applied.

--mark

--f46d0401689b259fa604cdc3e71e--