avro-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cutt...@apache.org
Subject svn commit: r1298117 - in /avro/trunk: ./ doc/src/content/htmldocs/ doc/src/content/xdocs/ lang/java/avro/src/main/java/org/apache/avro/ lang/java/avro/src/test/java/org/apache/avro/ lang/java/avro/src/test/java/org/apache/avro/util/ share/test/data/
Date Wed, 07 Mar 2012 21:01:33 GMT
Author: cutting
Date: Wed Mar  7 21:01:33 2012
New Revision: 1298117

URL: http://svn.apache.org/viewvc?rev=1298117&view=rev
Log:
AVRO-1006.  Add schema fingerprinting to specification and Java.  Contributed by Raymie Stata.

Added:
    avro/trunk/doc/src/content/htmldocs/
    avro/trunk/doc/src/content/htmldocs/canonical-completeness.html   (with props)
    avro/trunk/lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java   (with props)
    avro/trunk/lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java   (with props)
    avro/trunk/lang/java/avro/src/test/java/org/apache/avro/util/CaseFinder.java   (with props)
    avro/trunk/lang/java/avro/src/test/java/org/apache/avro/util/TestCaseFinder.java   (with props)
    avro/trunk/share/test/data/schema-tests.txt   (with props)
Modified:
    avro/trunk/CHANGES.txt
    avro/trunk/doc/src/content/xdocs/spec.xml

Modified: avro/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/avro/trunk/CHANGES.txt?rev=1298117&r1=1298116&r2=1298117&view=diff
==============================================================================
--- avro/trunk/CHANGES.txt (original)
+++ avro/trunk/CHANGES.txt Wed Mar  7 21:01:33 2012
@@ -4,6 +4,9 @@ Avro 1.7.0 (unreleased)
 
   NEW FEATURES
 
+    AVRO-1006.  Add schema fingerprinting to specification and Java.
+    (Raymie Stata via cutting)
+
   IMPROVEMENTS
 
   BUG FIXES

Added: avro/trunk/doc/src/content/htmldocs/canonical-completeness.html
URL: http://svn.apache.org/viewvc/avro/trunk/doc/src/content/htmldocs/canonical-completeness.html?rev=1298117&view=auto
==============================================================================
--- avro/trunk/doc/src/content/htmldocs/canonical-completeness.html (added)
+++ avro/trunk/doc/src/content/htmldocs/canonical-completeness.html Wed Mar  7 21:01:33 2012
@@ -0,0 +1,204 @@
+<html>
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+<head>
+<title>Completeness of "Parsing Canonical Form"</title>
+</head>
+<body>
+
+<center><h1>Completeness of "Parsing Canonical Form"</h1></center>
+
+<h2>1.0 Introduction</h2>
+
+<p>One of the defining characteristics of Avro is that a reader is assumed to have the "same" schema used by the writer of the data the reader is reading.  This assumption leads to a data format that's compact and amenable to many forms of schema evolution.  However, there are nuances to defining exactly what it means for the reader to have "the same" schema used by the writer.  We want to allow, for example, trivial transformations, such as the insertion of whitespace.  But we can't allow transformations that change the real meaning of schemas, such as a reordering of fields in a record</p>
+
+<p>To clearly define what it means for a reader to have "the same" schema as a writer, the Avro specification defines <dfn>Parsing Canonical Form</dfn> (PCF), a set of transformations on Avro schemas that strip away irrelevencies (e.g., "doc" attributes) and normalize the JSON text (e.g., dealing with whitespace).  Two schemas are defined to be "the same" as far as a reader is concerned if and only if their PCFs are textually equal.</p>
+
+<p>We believe that PCF is <em>sound</em> and <em>complete</em>.  Soundness means that the PCF of a schema is logically equivalent to the original form, i.e., we can use the PCF in place of the original form without introducing bugs.  Completeness is "maximal soundness:" if two schemas are logically equivalent, then their PFCs will be textually identical.  The Avro specification claims that PCF is complete when it says: "[if two schemas have the same PCF, then] there is no serialized data that would allow a reader to distinguish data generated by a writer using one of the original schemas from data generated by a writing using the other original schema."</p>
+
+<p>We believe that the transformations that define PCF are "self-evidently" sound to people familiar with Avro.  For example, fixing the order of fields in a JSON object, or eliminating irrelevant attributes like <code>doc</code>, or using the simple <code>int</code> in place of <code>{"type":"int"}</code> clearly don't change the meaning of a schema.</p>
+
+<p>Completeness, on the other hand, is much less obvious.  How do we know that there aren't two logically equivalent schemas that happen to reduce to different canonical forms?  All it takes is one such pair to foil our claim of completeness.</p>
+
+<p>In general, completeness properties like this can be tricky to prove.  It turns out that, while soundness is critical to us, completeness is not.  If two schemas are operationally equivalent (i.e., a reader can't tell their output apart), but we accidentally treat them as if they are different, then typically all that happens is that we'll do more work.  For example, we might generate a decoder object to decode some incoming data when it turns out that we had already cached a decoder object that could do the job.  This is not likely to happen often, and thus incompleteness isn't a huge problem.</p>
+
+<p>At the same time, if we knew that our canonical forms were complete, then we might take advantage of that fact in some circumstances (e.g., to serialize schemas).  Also, the <code>Schema.equals(Object)</code> method provided in the Avro implementation makes many of the same assumptions made in the PCF definition.  Thus, a completeness proof for our canonicalization would give us confidence in the correctness of this equality algorithm.  So this issue is not entirely academic.</p>
+
+<p>We haven't worked out a full, formal proof (we hope someone from the community will step up to that task!).  However, we've been thinking about it quite a bit, and we thought we'd share our thoughts so far.</p>
+
+
+<h2>2.0 Completeness argument for Parsing Canonical Form</h2>
+
+<p>Our formalization of Avro schemas would be based on interpreting them as grammars.  In this interpretation, Avro schemas are grammars that generate tagged data streams.  Consider, for example, the following schema for a linked-list:
+<pre>
+  {"type":"record", "name":"list", "fields":[
+     {"name":"value", "type":"int"},
+     {"name":"tail",  "type":["null", "list"]}
+   ]}
+</pre>
+Interpreted as a grammar, it can generate a tagged data-stream that looks like this:
+<pre>
+  [record,"list"][field,"value"][int,10][field,"tail"][union,1]
+    [record,"list"][field,"value"][int,22][field,"tail"][union,0]
+</pre>
+(this is a two-record linked list whose first cell contains the value "10" and second cell the value "22").  Avro schemas can trivially be interpreted as grammars for such tagged data streams.  Formal proofs involving Avro schemas can be carried out as proofs about languages and grammars.</p>
+
+<p>So what does it mean for the canonical form of a schema to be "complete?"  Let <i>L(S)</i> denote the language generated by the Avro schema <code>S</code>, and <i>C(S)</i> denote the canonical form of the schema.  The canonicalization is complete if:
+<blockquote>
+For all schemas <i>S<sub>1</sub></i> and <i>S<sub>2</sub></i>,<br>
+&nbsp;&nbsp;&nbsp;&nbsp;<i>L(S<sub>1</sub>) = L(S<sub>2</sub>) &rArr; C(S<sub>1</sub>) = C(S<sub>2</sub>)</i>
+</blockquote>
+That is, for any two schemas that generate the same language, their canonicalizations are textually equivalent.
+
+<p>To prove this, we need to define some functions:
+<blockquote>
+<i>J</i> is a variable name we often use to denote a JSON expression representing an Avro schema<br>
+<i>C(J)</i> is the Parsing Canonical Form of <i>J</i> as defined in the Avro specification<br>
+<i>P(J)</i> is the ASG for an Avro schema generated by parsing <i>J</i> (think of <i>P(J)</i> as a <code>Schema</code> Java object)<br>
+<i>S</i> is a variable name we often use to denote such ASGs<br>
+<i>L(S)</i> is the language generated by a schema ASG
+</blockquote>
+<p>With all these symbols defined, our completeness criteria is now rendered as:
+<blockquote>
+&forall; <i>J<sub>1</sub></i>, <i>J<sub>2</sub></i>:
+<i>L(P(J<sub>1</sub>)) = L(P(J<sub>2</sub>)) &rArr; C(J<sub>1</sub>) = C(J<sub>2</sub>)</i>
+</blockquote>
+We'll prove this by breaking it into two parts:
+<blockquote>
+(1): &forall; <i>S<sub>1</sub></i>, <i>S<sub>2</sub></i>:
+<i>L(S<sub>1</sub>) = L(S<sub>2</sub>) &rArr; S<sub>1</sub> &cong; S<sub>2</sub>  <br>
+(2): &forall; <i>J<sub>1</sub></i>, <i>J<sub>2</sub></i>:
+<i>P(J<sub>1</sub>) &cong; P(J<sub>2</sub>) &rArr; C(J<sub>1</sub>) = C(J<sub>2</sub>)</i>
+</i>
+</blockquote>
+In this two-step decomposition, we've introduced a new operator &cong;, which compares the ASGs of two Avro schemas.  The ASG of an Avro schema can be viewed as a rooted, labeled, directed graph.  Because Avro schemas can be recursive, these graphs can be cyclic.  The &cong; operator is "true" between two ASGs when the set of minimal labeled paths (no cycles, starting from the root) on the two ASGs are the same.  (The <code>Schema.equals(Object)</code> method in the Avro implementation computes something close to this &cong; relation, except that &cong; ignores "irrelevant" attributes like <code>doc</code> and <code>aliases</code>.)
+
+<p>It turns out that, implicit in the Avro Specification, there are "canonicalization" rules that are important to our proof of completeness.  In particular, the Avro Specification says that a name must be defined "before" it is used, and that a name cannot be defined more than once in a schema.  Consider the following redefinition of the linked-list schema, for example:
+<pre>
+  {"type":"record", "name":"list", "fields":[
+    {"name":"value", "type":"int"},
+    {"name":"tail",
+      "type":["null", {"type":"record", "name":"list", "fields":[
+                        {"name":"value", "type":"int"},
+                        {"name":"tail", "type":["null", "list"]}]}]}
+  ]}
+</pre>
+In this redefinition, we've "unpacked" the recursion in the linked list by one level.  In some sense, this is a perfectly fine definition of a linked list, and is operationally equivalent to the more compact version given earlier.  So it makes sense that our claim of completeness is dependent upon this kind of "unpacking" not occuring in real schemas.</p>
+
+<p>To deal with this issue in our proof, we pretend that the Avro specification does <em>not</em> require that named schemas be defined just once, and be defined "before" they are used.  Rather, we treat this requirement as an additional transformation rule in the definition of Parsing Canonical Form:
+<ul>
+  <li> [MINIMIZE] Eliminate redundant definitions of named types (records, enums, and fixeds).  That is, for each named type, have a defining instance that appears at first use, and then use just the name (rather than the full schema) everywhere else.</li>
+</ul>
+(As in the Avro spec, "first use" is defined as the first occurrence in a depth-first, left-to-right traversal of the schema abstract-syntax graph (ASG).)
+
+<p>Getting back to the proof of (1) and (2) from above, we need to introduce more functions:
+<blockquote>
+<i>P(J)=P<sub>A</sub>(P<sub>J</sub>(J))</i> - decompose parser into: <br>
+&nbsp;&nbsp;<i>P<sub>J</sub></i> is the JSON parser<br>
+&nbsp;&nbsp;<i>P<sub>A</sub></i> is the Avro parser (takes JSON ASTs as input)<br>
+<i>C(J)=C<sub>J</sub>(C<sub>A</sub>(C<sub>M</sub>(J)))</i> - decompose canonicalization into:<br>
+&nbsp;&nbsp;<i>C<sub>M</sub>(J)</i> the MINIMIZE step<br>
+&nbsp;&nbsp;<i>C<sub>A</sub>(J)</i> Avro normalizations<br>
+&nbsp;&nbsp;<i>C<sub>J</sub>(J)</i> JSON normalizations<br>
+<i>M(S)</i> is the "named-schema NFA minimzation" of <i>S</i><br>
+</blockquote>
+"Named-schema NFA minimization" is similar to general NFA minimization, except that we only collapse nodes and edges related to named schema entities and not other nodes.  For example, we would <em>not</em> collapse the nodes associated with <code>int</code> or <code>union</code> schemas.
+
+<p> Our proof of (1) looks like this (this proof refers to lemmas (3) and (4), which are defined later):
+<blockquote>
+<table>
+<tr><td>&forall;<i>S<sub>1</sub>,S<sub>2</sub></i>:</td><td><i>L(S<sub>1</sub>)=L(S<sub>2</sub>)</i></td><td></td></tr>
+<tr>
+<td></td><td>&rArr;<i>M(S<sub>1</sub>)=M(S<sub>2</sub>)</i></td>
+<td>by (3)</td>
+</tr>
+<tr>
+<td></td><td>&rArr;<i>S<sub>1</sub>&cong;S<sub>2</sub)</i></td>
+<td>by (4)</td>
+</tr>
+</table>
+</blockquote>
+Here's the proof of (2) (this proof refers to lemmas (4)-(7), which are defined later):
+<blockquote>
+<table>
+<tr><td>&forall;<i>J<sub>1</sub>,J<sub>2</sub></i>:</td><td><i>P(J<sub>1</sub>)&cong;P(J<sub>2</sub>)</i></td><td></td></tr>
+
+<tr>
+<td></td><td>&rArr;<i>M(P(J<sub>1</sub>))=M(P(J<sub>2</sub>))</i></td>
+<td>by (4)</td>
+</tr>
+
+<tr>
+<td></td><td>&rArr;<i>P(C<sub>M</sub>(J<sub>1</sub>))=P(C<sub>M</sub>(J<sub>2</sub>))</i></td>
+<td>by (5)</td>
+</tr>
+
+<tr>
+<td></td><td>&rArr;<i>P<sub>A</sub>(P<sub>J</sub>(C<sub>M</sub>(J<sub>1</sub>)))=P<sub>A</sub>(P<sub>J</sub>(C<sub>M</sub>(J<sub>2</sub>)))</i></td>
+<td>by definition of <i>P</i></td>
+</tr>
+
+<tr>
+<td></td><td>&rArr;<i>P<sub>J</sub>(C<sub>A</sub>(C<sub>M</sub>(J<sub>1</sub>)))=P<sub>J</sub>(C<sub>A</sub>(C<sub>M</sub>(J<sub>2</sub>)))</i></td>
+<td>by (6)</td>
+</tr>
+
+<tr>
+<td></td><td>&rArr;<i>C<sub>J</sub>(C<sub>A</sub>(C<sub>M</sub>(J<sub>1</sub>)))=C<sub>J</sub>(C<sub>A</sub>(C<sub>M</sub>(J<sub>2</sub>)))</i></td>
+<td>by (7)</td>
+</tr>
+
+<tr>
+<td></td><td>&rArr;<i>C(J<sub>1</sub>)=C(J<sub>2</sub>)</i></td>
+<td>by definition of <i>C</i></td>
+</tr>
+</table>
+</blockquote>
+
+Here are the lemmas needed above:
+<blockquote>
+(3): &forall; <i>S<sub>1</sub></i>, <i>S<sub>2</sub></i>:
+<i>L(S<sub>1</sub>) = L(S<sub>2</sub>) &rArr; M(S<sub>1</sub>) = M(S<sub>2</sub>)</i><br>
+
+(4): &forall; <i>S<sub>1</sub></i>, <i>S<sub>2</sub></i>:
+<i>M(S<sub>1</sub>) = M(S<sub>2</sub>) &hArr; S<sub>1</sub> &cong; S<sub>2</sub></i> <br>
+
+(5): &forall; <i>J</i>: <i>M(P(J)) = P(C<sub>M</sub>(J))</i><br>
+
+(6): &forall; <i>J<sub>1</sub></i>, <i>J<sub>2</sub></i>:
+<i>P<sub>A</sub>(P<sub>J</sub>(J<sub>1</sub>)) = P<sub>A</sub>(P<sub>J</sub>(J<sub>2</sub>)) &rArr; P<sub>J</sub>(C<sub>A</sub>(J<sub>1</sub>)) = P<sub>J</sub>(C<sub>A</sub>(J<sub>2</sub>))</i> <br>
+
+(7): &forall; <i>J<sub>1</sub></i>, <i>J<sub>2</sub></i>:
+<i>P<sub>J</sub>(J<sub>1</sub>) = P<sub>J</sub>(J<sub>2</sub>) &rArr; C<sub>J</sub>(J<sub>1</sub>) = C<sub>J</sub>(J<sub>2</sub>)</i> <br>
+</blockquote>
+
+<p>Proving the lemmas:
+<ol start="3">
+<li> This says that the language-related part of our canonicalization is complete, i.e., <i>M</i> finds the equivalence-classes of <i>L</i>.  I would imagine one could prove this by modifying a proof that the equality of LL(1) grammars is a decidable problem.  I haven't gotten very far in showing this, however.
+<li> The right-hand direction of this follows from the definition of minimization.  The left-hand direction seems correct, but I'm not sure how to prove it (I think it also follows from the definition of minimization).
+<li> This is showing that the MINIMIZE step (which is done on JSON expressions) is equivalent to doing an named-schema NFA minimization on the ASG representation.  This should follow pretty directly from a detailed definition of <i>M</i>, if we provided one.
+<li> This says that the Avro-related part of our canonicalization is complete, i.e., that <i>C<sub>A</sub></i> finds equivalence-classes of <i>P<sub>A</sub></i>.
+<li> This says that the JSON-related part of our canonicalization is complete, i.e., that <i>C<sub>J</sub></i> finds equivalence-classes of <i>P<sub>J</sub></i>.  Note that, implicitly, this lemma ranges over only JSON expressions that are legal Avro schemas with no doc strings or default values, and thus (for example) doesn't need to worry about normalization of floating-point literals.
+</ol>
+
+
+<h2>3.0 Concluding remarks</h2>
+
+Engineers <a href="http://www.aps.org/publications/apsnews/201002/physicshistory.cfm">have a history</a> of running ahead of formal mathematical proofs, when things "seem correct" to them.  In this case, it seems pretty obvious that Parsing Canonical Form is complete as well as sound, and we should go ahead and treat it as such.  At the same time, formal proofs often turn up corner cases and exceptions that are valuable to document and account for.  Thus, it'd nice if someone could provide a better completeness argument than we've been able to so far.
+
+</body>
+</html>

Propchange: avro/trunk/doc/src/content/htmldocs/canonical-completeness.html
------------------------------------------------------------------------------
    svn:eol-style = native

Modified: avro/trunk/doc/src/content/xdocs/spec.xml
URL: http://svn.apache.org/viewvc/avro/trunk/doc/src/content/xdocs/spec.xml?rev=1298117&r1=1298116&r2=1298117&view=diff
==============================================================================
--- avro/trunk/doc/src/content/xdocs/spec.xml (original)
+++ avro/trunk/doc/src/content/xdocs/spec.xml Wed Mar  7 21:01:33 2012
@@ -1135,6 +1135,198 @@
 
     </section>
 
+    <section>
+      <title>Parsing Canonical Form for Schemas</title>
+
+      <p>One of the defining characteristics of Avro is that a reader
+      is assumed to have the "same" schema used by the writer of the
+      data the reader is reading.  This assumption leads to a data
+      format that's compact and also amenable to many forms of schema
+      evolution.  However, the specification so far has not defined
+      what it means for the reader to have the "same" schema as the
+      writer.  Does the schema need to be textually identical?  Well,
+      clearly adding or removing some whitespace to a JSON expression
+      does not change its meaning.  At the same time, reordering the
+      fields of records clearly <em>does</em> change the meaning.  So
+      what does it mean for a reader to have "the same" schema as a
+      writer?</p>
+
+      <p><em>Parsing Canonical Form</em> is a transformation of a
+      writer's schema that let's us define what it means for two
+      schemas to be "the same" for the purpose of reading data written
+      agains the schema.  It is called <em>Parsing</em> Canonical Form
+      because the transformations strip away parts of the schema, like
+      "doc" attributes, that are irrelevant to readers trying to parse
+      incoming data.  It is called <em>Canonical Form</em> because the
+      transformations normalize the JSON text (such as the order of
+      attributes) in a way that eliminates unimportant differences
+      between schemas.  If the Parsing Canonical Forms of two
+      different schemas are textually equal, then those schemas are
+      "the same" as far as any reader is concerned, i.e., there is no
+      serialized data that would allow a reader to distinguish data
+      generated by a writer using one of the original schemas from
+      data generated by a writing using the other original schema.
+      (We sketch a proof of this property in a companion
+      document.)</p>
+
+      <p>The next subsection specifies the transformations that define
+      Parsing Canonical Form.  But with a well-defined canonical form,
+      it can be convenient to go one step further, transforming these
+      canonical forms into simple integers ("fingerprints") that can
+      be used to uniquely identify schemas.  The subsection after next
+      recommends some standard practices for generating such
+      fingerprints.</p>
+
+      <section>
+        <title>Transforming into Parsing Canonical Form</title>
+
+        <p>Assuming an input schema (in JSON form) that's already
+        UTF-8 text for a <em>valid</em> Avro schema (including all
+        quotes as required by JSON), the following transformations
+        will produce its Parsing Canonical Form:</p>
+        <ul>
+          <li> [PRIMITIVES] Convert primitive schemas to their simple
+          form (e.g., <code>int</code> instead of
+          <code>{"type":"int"}</code>).</li>
+
+          <li> [FULLNAMES] Replace short names with fullnames, using
+          applicable namespaces to do so.  Then eliminate
+          <code>namespace</code> attributes, which are now redundant.</li>
+
+          <li> [STRIP] Keep only attributes that are relevant to
+          parsing data, which are: <code>type</code>,
+          <code>name</code>, <code>fields</code>,
+          <code>symbols</code>, <code>items</code>,
+          <code>values</code>, <code>size</code>.  Strip all others
+          (e.g., <code>doc</code> and <code>aliases</code>).</li>
+
+          <li> [ORDER] Order the appearance of fields of JSON objects
+          as follows: <code>name</code>, <code>type</code>,
+          <code>fields</code>, <code>symbols</code>,
+          <code>items</code>, <code>values</code>, <code>size</code>.
+          For example, if an object has <code>type</code>,
+          <code>name</code>, and <code>size</code> fields, then the
+          <code>name</code> field should appear first, followed by the
+          <code>type</code> and then the <code>size</code> fields.</li>
+
+          <li> [STRINGS] For all JSON string literals in the schema
+          text, replace any escaped characters (e.g., \uXXXX escapes)
+          with their UTF-8 equivalents.</li>
+
+          <li> [INTEGERS] Eliminate quotes around and any leading
+          zeros in front of JSON integer literals (which appear in the
+          <code>size</code> attributes of <code>fixed</code> schemas).</li>
+
+          <li> [WHITESPACE] Eliminate all whitespace in JSON outside of string literals.</li>
+        </ul>
+      </section>
+
+      <section>
+        <title>Schema Fingerprints</title>
+
+        <p>"[A] fingerprinting algorithm is a procedure that maps an
+        arbitrarily large data item (such as a computer file) to a
+        much shorter bit string, its <em>fingerprint,</em> that
+        uniquely identifies the original data for all practical
+        purposes" (quoted from [<a
+        href="http://en.wikipedia.org/wiki/Fingerprint_(computing)">Wikipedia</a>]).
+        In the Avro context, fingerprints of Parsing Canonical Form
+        can be useful in a number of applications; for example, to
+        cache encoder and decoder objects, to tag data items with a
+        short substitute for the writer's full schema, and to quickly
+        negotiate common-case schemas between readers and writers.</p>
+
+        <p>In designing fingerprinting algorithms, there is a
+        fundamental trade-off between the length of the fingerprint
+        and the probability of collisions.  To help application
+        designers find appropriate points within this trade-off space,
+        while encouraging interoperability and ease of implementation,
+        we recommend using one of the following three algorithms when
+        fingerprinting Avro schemas:</p>
+
+        <ul>
+          <li> When applications can tolerate longer fingerprints, we
+          recommend using the <a
+          href="http://en.wikipedia.org/wiki/SHA-2">SHA-256 digest
+          algorithm</a> to generate 256-bit fingerprints of Parsing
+          Canonical Forms.  Most languages today have SHA-256
+          implementations in their libraries.</li>
+
+          <li> At the opposite extreme, the smallest fingerprint we
+          recommend is a 64-bit <a
+          href="http://en.wikipedia.org/wiki/Rabin_fingerprint">Rabin
+          fingerprint</a>.  Below, we provide pseudo-code for this
+          algorithm that can be easily translated into any programming
+          language.  64-bit fingerprints should guarantee uniqueness
+          for schema caches of up to a million entries (for such a
+          cache, the chance of a collision is 3E-8).  We don't
+          recommend shorter fingerprints, as the chances of collisions
+          is too great (for example, with 32-bit fingerprints, a cache
+          with as few as 100,000 schemas has a 50% chance of having a
+          collision).</li>
+
+          <li>Between these two extremes, we recommend using the <a
+          href="http://en.wikipedia.org/wiki/MD5">MD5 message
+          digest</a> to generate 128-bit fingerprints.  These make
+          sense only where very large numbers of schemas are being
+          manipulated (tens of millions); otherwise, 64-bit
+          fingerprints should be sufficient.  As with SHA-256, MD5
+          implementations are found in most libraries today.</li>
+        </ul>
+
+        <p> These fingerprints are <em>not</em> meant to provide any
+        security guarantees, even the longer SHA-256-based ones.  Most
+        Avro applications should be surrounded by security measures
+        that prevent attackers from writing random data and otherwise
+        interfering with the consumers of schemas.  We recommend that
+        these surrounding mechanisms be used to prevent collision and
+        pre-image attacks (i.e., "forgery") on schema fingerprints,
+        rather than relying on the security properties of the
+        fingerprints themselves.</p>
+
+        <p>Rabin fingerprints are <a
+        href="http://en.wikipedia.org/wiki/Cyclic_redundancy_check">cyclic
+        redundancy checks</a> computed using irreducible polynomials.
+        In the style of the Appendix of <a
+        href="http://www.ietf.org/rfc/rfc1952.txt">RFC&nbsp;1952</a>
+        (pg 10), which defines the CRC-32 algorithm, here's our
+        definition of the 64-bit AVRO fingerprinting algorithm:</p>
+
+        <source>
+long fingerprint64(byte[] buf) {
+  if (FP_TABLE == null) initFPTable();
+  long fp = EMPTY;
+  for (int i = 0; i &lt; buf.length; i++)
+    fp = (fp &gt;&gt;&gt; 8) ^ FP_TABLE[(int)(fp ^ buf[i]) &amp; 0xff];
+  return fp;
+}
+
+static long EMPTY = 0xc15d213aa4d7a795L;
+static long[] FP_TABLE = null;
+
+void initFPTable() {
+  FP_TABLE = new long[256];
+  for (int i = 0; i &lt; 256; i++) {
+    long fp = i;
+    for (int j = 0; j &lt; 8; j++)
+      fp = (fp &gt;&gt;&gt; 1) ^ (EMPTY &amp; -(fp &amp; 1L));
+    FP_TABLE[i] = fp;
+  }
+}
+        </source>
+
+        <p> Readers interested in the mathematics behind this
+        algorithm may want to read <a
+        href="http://www.scribd.com/fb-6001967/d/84795-Crc">this book
+        chapter.</a> (Unlike RFC-1952 and the book chapter, we prepend
+        a single one bit to messages.  We do this because CRCs ignore
+        leading zero bits, which can be problematic.  Our code
+        prepends a one-bit by initializing fingerprints using
+        <code>EMPTY</code>, rather than initializing using zero as in
+        RFC-1952 and the book chapter.)</p>
+      </section>
+    </section>
+
   <p><em>Apache Avro, Avro, Apache, and the Avro and Apache logos are
    trademarks of The Apache Software Foundation.</em></p>
 

Added: avro/trunk/lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java
URL: http://svn.apache.org/viewvc/avro/trunk/lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java?rev=1298117&view=auto
==============================================================================
--- avro/trunk/lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java (added)
+++ avro/trunk/lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java Wed Mar  7 21:01:33 2012
@@ -0,0 +1,173 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.avro;
+
+import java.util.Map;
+import java.util.HashMap;
+import java.io.IOException;
+import java.io.UnsupportedEncodingException;
+import java.security.MessageDigest;
+import java.security.NoSuchAlgorithmException;
+
+/** Collection of static methods for generating the cannonical form of
+ * schemas (see {@link #toParsingForm}) -- and fingerprints of cannonical
+ * forms ({@link #fingerprint}).
+ */
+public class SchemaNormalization {
+
+  private SchemaNormalization() {}
+
+  /** Returns "Parsing Canonical Form" of a schema as defined by Avro
+    * spec. */
+  public static String toParsingForm(Schema s) {
+    try {
+      Map<String,String> env = new HashMap<String,String>();
+      return build(env, s, new StringBuilder()).toString();
+    } catch (IOException e) {
+      // Shouldn't happen, b/c StringBuilder can't throw IOException
+      throw new RuntimeException(e);
+    }
+  }
+
+  /** Returns a fingerprint of a string of bytes.  This string is
+    * presumed to contain a canonical form of a schema.  The
+    * algorithm used to compute the fingerprint is selected by the
+    * argument <i>fpName</i>.  If <i>fpName</i> equals the string
+    * <code>"CRC-64-AVRO"</code>, then the result of {@link #fingerprint64} is
+    * returned in little-endian format.  Otherwise, <i>fpName</i> is
+    * used as an algorithm name for {@link
+    * MessageDigest#getInstance(String)}, which will throw
+    * <code>NoSuchAlgorithmException</code> if it doesn't recognize
+    * the name.
+    * <p> Recommended Avro practice dictiates that
+    * <code>"CRC-64-AVRO"</code> is used for 64-bit fingerprints,
+    * <code>"MD5"</code> is used for 128-bit fingerprints, and
+    * <code>"SHA-256"</code> is used for 256-bit fingerprints. */
+  public static byte[] fingerprint(String fpName, byte[] data)
+    throws NoSuchAlgorithmException
+  {
+    if (fpName.equals("CRC-64-AVRO")) {
+      long fp = fingerprint64(data);
+      byte[] result = new byte[8];
+      for (int i = 0; i < 8; i++) {
+        result[i] = (byte)fp;
+        fp >>= 8;
+      }
+      return result;
+    }
+
+    MessageDigest md = MessageDigest.getInstance(fpName);
+    return md.digest(data);
+  }
+
+  /** Returns the 64-bit Rabin Fingerprint (as recommended in the Avro
+    * spec) of a byte string. */
+  public static long fingerprint64(byte[] data) {
+    if (fpTable64 == null) fillFPTable64();
+    long result = EMPTY64;
+    for (byte b: data)
+      result = (result >>> 8) ^ fpTable64[(int)(result ^ b) & 0xff];
+    return result;
+  }
+
+  /** Returns {@link #fingerprint} applied to the parsing canonical form
+    * of the supplied schema. */
+  public static byte[] parsingFingerprint(String fpName, Schema s)
+    throws NoSuchAlgorithmException
+  {
+    try {
+      return fingerprint(fpName, toParsingForm(s).getBytes("UTF-8"));
+    } catch (UnsupportedEncodingException e) { throw new RuntimeException(e); }
+  }
+
+  /** Returns {@link #fingerprint64} applied to the parsing canonical form
+    * of the supplied schema. */
+  public static long parsingFingerprint64(Schema s) {
+    try {
+      return fingerprint64(toParsingForm(s).getBytes("UTF-8"));
+    } catch (java.io.UnsupportedEncodingException e)
+      { throw new RuntimeException(e); }
+  }
+
+  private static Appendable build(Map<String,String> env, Schema s,
+                                  Appendable o) throws IOException {
+    boolean firstTime = true;
+    Schema.Type st = s.getType();
+    switch (st) {
+    default: // boolean, bytes, double, float, int, long, null, string
+      return o.append('"').append(st.getName()).append('"');
+
+    case UNION:
+      o.append('[');
+      for (Schema b: s.getTypes()) {
+        if (! firstTime) o.append(','); else firstTime = false;
+        build(env, b, o);
+      }
+      return o.append(']');
+
+    case ARRAY:  case MAP:
+      o.append("{\"type\":\"").append(st.getName()).append("\"");
+      if (st == Schema.Type.ARRAY)
+        build(env, s.getElementType(), o.append(",\"items\":"));
+      else build(env, s.getValueType(), o.append(",\"values\":"));
+      return o.append("}");
+
+    case ENUM: case FIXED: case RECORD:
+      String name = s.getFullName();
+      if (env.get(name) != null) return o.append(env.get(name));
+      String qname = "\""+name+"\"";
+      env.put(name, qname);
+      o.append("{\"name\":").append(qname);
+      o.append(",\"type\":\"").append(st.getName()).append("\"");
+      if (st == Schema.Type.ENUM) {
+        o.append(",\"symbols\":[");
+        for (String enumSymbol: s.getEnumSymbols()) {
+          if (! firstTime) o.append(','); else firstTime = false;
+          o.append('"').append(enumSymbol).append('"');
+        }
+        o.append("]");
+      } else if (st == Schema.Type.FIXED) {
+        o.append(",\"size\":").append(Integer.toString(s.getFixedSize()));
+      } else { // st == Schema.Type.RECORD
+        o.append(",\"fields\":[");
+        for (Schema.Field f: s.getFields()) {
+          if (! firstTime) o.append(','); else firstTime = false;
+          o.append("{\"name\":\"").append(f.name()).append("\"");
+          build(env, f.schema(), o.append(",\"type\":")).append("}");
+        }
+        o.append("]");
+      }
+      return o.append("}");
+    }
+  }
+
+  final static long EMPTY64 = 0xc15d213aa4d7a795L;
+  private static long[] fpTable64 = null;
+
+  private static void fillFPTable64() {
+    fpTable64 = new long[256];
+    for (int i = 0; i < 256; i++) {
+      long fp = i;
+      for (int j = 0; j < 8; j++) {
+        long mask = -(fp & 1L);
+        fp = (fp >>> 1) ^ (EMPTY64 & mask);
+      }
+      fpTable64[i] = fp;
+    }
+  }
+}

Propchange: avro/trunk/lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java
------------------------------------------------------------------------------
    svn:eol-style = native

Added: avro/trunk/lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java
URL: http://svn.apache.org/viewvc/avro/trunk/lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java?rev=1298117&view=auto
==============================================================================
--- avro/trunk/lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java (added)
+++ avro/trunk/lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java Wed Mar  7 21:01:33 2012
@@ -0,0 +1,122 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.avro;
+
+import java.io.BufferedReader;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.List;
+import java.util.ArrayList;
+import java.util.Formatter;
+
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+import org.junit.runners.Parameterized.Parameters;
+import org.junit.experimental.runners.Enclosed;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import org.apache.avro.util.CaseFinder;
+
+
+@RunWith(Enclosed.class)
+public class TestSchemaNormalization {
+
+  @RunWith(Parameterized.class)
+  public static class TestCanonical {
+    String input, expectedOutput;
+    public TestCanonical(String i, String o) { input=i; expectedOutput=o; }
+
+    @Parameters public static List<Object[]> cases() throws IOException
+    { return CaseFinder.find(data(), "canonical", new ArrayList<Object[]>()); }
+
+    @Test public void testCanonicalization() throws Exception {
+      assertEquals(SchemaNormalization.toParsingForm(Schema.parse(input)),
+                   expectedOutput);
+    }
+  }
+
+  @RunWith(Parameterized.class)
+  public static class TestFingerprint {
+    String input, expectedOutput;
+    public TestFingerprint(String i, String o) { input=i; expectedOutput=o; }
+
+    @Parameters public static List<Object[]> cases() throws IOException
+    { return CaseFinder.find(data(),"fingerprint",new ArrayList<Object[]>()); }
+
+    @Test public void testCanonicalization() throws Exception {
+      Schema s = Schema.parse(input);
+      long carefulFP = altFingerprint(SchemaNormalization.toParsingForm(s));
+      assertEquals(carefulFP, Long.parseLong(expectedOutput));
+      assertEqHex(carefulFP, SchemaNormalization.parsingFingerprint64(s));
+    }
+  }
+
+  private static String DATA_FILE =
+    (System.getProperty("share.dir", "../../../share")
+     + "/test/data/schema-tests.txt");
+
+  private static BufferedReader data() throws IOException
+  { return new BufferedReader(new FileReader(DATA_FILE)); }
+
+  /** Compute the fingerprint of <i>bytes[s,s+l)</i> using a slow
+      algorithm that's an alternative to that implemented in {@link
+      SchemaNormalization}.  Algo from Broder93 ("Some applications of Rabin's
+      fingerpringint method"). */
+  public static long altFingerprint(String s) {
+    // In our algorithm, we multiply all inputs by x^64 (which is
+    // equivalent to prepending it with a single "1" bit followed
+    // by 64 zero bits).  This both deals with the fact that
+    // CRCs ignore leading zeros, and also ensures some degree of
+    // randomness for small inputs
+    try {
+      long tmp = altExtend(SchemaNormalization.EMPTY64, 64, ONE,
+                           s.getBytes("UTF-8"));
+      return altExtend(SchemaNormalization.EMPTY64, 64, tmp, POSTFIX);
+    } catch (java.io.UnsupportedEncodingException e)
+      { throw new RuntimeException(e); } 
+  }
+
+  private static long altExtend(long poly, int degree, long fp, byte[] b) {
+    final long overflowBit = 1L<<(64-degree);
+    for (int i = 0; i < b.length; i++) {
+      for (int j = 1; j < 129; j = j<<1) {
+        boolean overflow = (0 != (fp & overflowBit));
+        fp >>>= 1;
+        if (0 != (j&b[i])) fp |= ONE; // shift in the input bit
+        if (overflow) {
+          fp ^= poly; // hi-order coeff of poly kills overflow bit
+        }
+      }
+    }
+    return fp;
+  }
+
+  private static final long ONE = 0x8000000000000000L;
+  private static final byte[] POSTFIX = { 0, 0, 0, 0, 0, 0, 0, 0 };
+
+  private static void assertEqHex(long expected, long actual) {
+    String m = format("0x%016x != 0x%016x", expected, actual).toString();
+    assertTrue(m, expected == actual);
+  }
+
+  private static String format(String f, Object... args) {
+    return (new Formatter()).format(f, args).toString();
+  }
+}

Propchange: avro/trunk/lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java
------------------------------------------------------------------------------
    svn:eol-style = native

Added: avro/trunk/lang/java/avro/src/test/java/org/apache/avro/util/CaseFinder.java
URL: http://svn.apache.org/viewvc/avro/trunk/lang/java/avro/src/test/java/org/apache/avro/util/CaseFinder.java?rev=1298117&view=auto
==============================================================================
--- avro/trunk/lang/java/avro/src/test/java/org/apache/avro/util/CaseFinder.java (added)
+++ avro/trunk/lang/java/avro/src/test/java/org/apache/avro/util/CaseFinder.java Wed Mar  7 21:01:33 2012
@@ -0,0 +1,211 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.avro.util;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.List;
+import java.util.regex.Pattern;
+import java.util.regex.Matcher;
+
+/** Parser for files containing test cases consisting of
+ * <code>&lt;String,String&gt;</code> pairs, where the first string is
+ * the input to the test case, and the second string is the expected
+ * output of the test case.
+ *
+ * <p> A test-case file is a sequence of <a
+ * href="en.wikipedia.org/wiki/Here_document">here documents</a>
+ * ("heredocs"), very similar in syntax to Unix Shell heredocs.
+ * Heredocs labeled "INPUT" indicate the start of a new case, and
+ * these INPUT heredocs the inputs of test cases.  Following an
+ * "INPUT" heredoc can more zero or more "expected-output" heredocs.
+ * Each of these expected-output heredocs defines what we call a
+ * <dfn>subcase</dfn>.  The assumption here is that for each
+ * interesting test input, there are often multiple different tests
+ * one could run, each with different expected outputs.
+ *
+ * <p> Consumers of this class call the {@link #find} method to find
+ * all subcases marked with a given label.  For example, imagine the
+ * following test-case file:
+ * <blockquote> <pre>
+ *    &lt;&lt;INPUT 0
+ *    &lt;&lt;VALUE 0
+ *    &lt;&lt;PPRINT 0
+ *    &lt;&lt;INPUT 1+1
+ *    &lt;&lt;VALUE 2
+ *    &lt;&lt;PPRINT 1 + 1
+ *    &lt;&lt;SEXP (+ 1 1)
+ *    SEXP
+ * </pre> </blockquote>
+ * Calling {@link #find} on the label "VALUE" will return two test
+ * cases, the pair <code>&lt;"0","0"&gt;</code> and
+ * <code>&lt;"1+1","2"&gt;</code>.  Calling it on the label "PPRINT"
+ * will return <code>&lt;"0","0"&gt;</code> and <code>&lt;"1+1","1 +
+ * 1"&gt;</code>.  Notice that there need not be a subcase for every
+ * INPUT.  In the case of "SEXP", for example, {@link #find} will
+ * return only the single pair <code>&lt;"1+1","(+ 1 1)"&gt;</code>.
+ *
+ * <p> There are two forms of heredocs, single-line and multi-line.
+ * The examples above (except "SEXP") are single-line heredocs.  The
+ * general syntax for these is:
+ * <blockquote> <pre>
+ * ^&lt;&lt;([a-zA-Z][_a-zA-Z0-9]*) (.*)$
+ * </pre> </blockquote>
+ * The first group in this regex is the label of the heredoc, and the
+ * second group is the text of the heredoc.  A single space separates
+ * the two groups and is not part of there heredoc (subsequent spaces
+ * <em>will</em> be included in the heredoc).  A "line terminator" as
+ * defined by the Java language (i.e., CR, LR, or CR followed by LF)
+ * terminates a singline-line heredoc but is not included in the text
+ * of the heredoc.
+ *
+ * <p> As the name implies, multi-line heredocs are spread across
+ * multiple lines, as in this example:
+ * <blockquote> <pre>
+ *    &lt;&lt;INPUT
+ *    1
+ *    +1 +
+ *    1
+ *    INPUT
+ *    &lt;&lt;VALUE 3
+ *    &lt;&lt;PPRINT 1 + 1 + 1
+ * </pre> </blockquote>
+ * In this case, the input to the test case is spread across multiple
+ * lines (the line terminators in these documents are preserved as
+ * part of the document text).  Multi-line heredocs can be used for
+ * both the inputs of text cases and the expected outputs of them.
+
+ * <p> The syntax of multi-line heredocs obey the following pseudo-regex:
+ * <blockquote> <pre>
+ * ^&lt;&lt;([a-zA-Z][_a-zA-Z0-9]*)$(.*)$^\1$
+ * </pre> </blockquote>
+ * That is, as illustrated by the example, a multi-line heredoc named
+ * "LABEL" consists of the text <code>&lt;lt;LABEL</code> on a line by
+ * itself, followed by the text of the heredoc, followed by the text
+ * <code>LABEL</code> on a line by itself (if LABEL starts a line but
+ * is not the <em>only</em> text on that line, then that entire line
+ * is part of the heredoc, and the heredoc is not terminated by that
+ * line).
+ *
+ * <p>In multi-line heredocs, neither the line terminator that
+ * terminates the start of the document, nor the one just before the
+ * label that ends the heredoc, are part of the text of the heredoc.
+ * Thus, for example, the text of the multi-line input from above
+ * would be exactly <code>"1\n+1&nbsp;+\n1"</code>.  If you want a new
+ * line at the end of a multi-line heredoc, put a blank line before
+ * the label ending the heredoc.
+ *
+ * <p>Also in multi-line heredocs, line-terminators within the heredoc
+ * are normalized to line-feeds ('\n').  Thus, for example, when a
+ * test file written on a Windows machine is parsed on any machine,
+ * the Windows-style line terminators within heredocs will be
+ * translated to Unix-style line terminators, no matter what platform
+ * the tests are run on.
+ *
+ * <p> Note that lines between heredocs are ignored, and can be used
+ * to provide spacing between and/or commentary on the test cases.
+ */
+public class CaseFinder {
+  /** Scan test-case file <code>in</code> looking for test subcases
+    * marked with <code>caseLabel</code>.  Any such cases are appended
+    * (in order) to the "cases" parameter.  If <code>caseLabel</code>
+    * equals the string <code>"INPUT"</code>, then returns the list of
+    * &lt;<i>input</i>, <code>null</code>&gt; pairs for <i>input</i>
+    * equal to all heredoc's named INPUT's found in the input
+    * stream. */
+  public static List<Object[]> find(BufferedReader in, String label,
+                                    List<Object[]> cases)
+    throws IOException
+  {
+    if (! Pattern.matches(LABEL_REGEX, label))
+      throw new IllegalArgumentException("Bad case subcase label: " + label);
+
+    final String subcaseMarker = "<<" + label;
+
+    for (String line = in.readLine();;) {
+      // Find next new case
+      while (line != null && !line.startsWith(NEW_CASE_MARKER))
+        line = in.readLine();
+      if (line == null) break;
+      String input;
+      input = processHereDoc(in, line);
+
+      if (label.equals(NEW_CASE_NAME)) {
+        cases.add(new Object[] { input, null });
+        line = in.readLine();
+        continue;
+      }
+
+      // Check to see if there's a subcase named "label" for that case
+      do {
+        line = in.readLine();
+      } while (line != null && (!line.startsWith(NEW_CASE_MARKER)
+                                && !line.startsWith(subcaseMarker)));
+      if (line == null || line.startsWith(NEW_CASE_MARKER)) continue;
+      String expectedOutput = processHereDoc(in, line);
+
+      cases.add(new Object[] { input, expectedOutput });
+    }
+    in.close();
+    return cases;
+  }
+
+  private static final String NEW_CASE_NAME = "INPUT";
+  private static final String NEW_CASE_MARKER = "<<"+NEW_CASE_NAME;
+  private static final String LABEL_REGEX = "[a-zA-Z][_a-zA-Z0-9]*";
+  private static final Pattern START_LINE_PATTERN
+    = Pattern.compile("^<<("+LABEL_REGEX+")(.*)$");
+
+  /** Reads and returns content of a heredoc.  Assumes we just read a
+    * start-of-here-doc marker for a here-doc labeled "docMarker."
+    * Replaces arbitrary newlines with sytem newlines, but strips
+    * newline from final line of heredoc.  Throws IOException if EOF
+    * is reached before heredoc is terminate. */
+  private static String processHereDoc(BufferedReader in, String docStart)
+    throws IOException
+  {
+    Matcher m = START_LINE_PATTERN.matcher(docStart);
+    if (! m.matches())
+      throw new IllegalArgumentException("Wasn't given the start of a heredoc (\""+docStart+"\")");
+    String docName = m.group(1);
+
+    // Determine if this is a single-line heredoc, and process if it is
+    String singleLineText = m.group(2);
+    if (singleLineText.length() != 0) {
+      if (! singleLineText.startsWith(" ")) 
+        throw new IOException("Single-line heredoc missing initial space (\""+docStart+"\")");
+      return singleLineText.substring(1);
+    }
+
+    // Process multi-line heredocs
+    StringBuilder result = new StringBuilder();
+    String line = in.readLine();
+    String prevLine = "";
+    boolean firstTime = true;
+    while (line != null && !line.equals(docName)) {
+      if (! firstTime) result.append(prevLine).append('\n');
+      else firstTime = false;
+      prevLine = line;
+      line = in.readLine();
+    }
+    if (line == null)
+      throw new IOException("Here document (" + docName
+                            + ") terminated by end-of-file.");
+    return result.append(prevLine).toString();
+  }
+}

Propchange: avro/trunk/lang/java/avro/src/test/java/org/apache/avro/util/CaseFinder.java
------------------------------------------------------------------------------
    svn:eol-style = native

Added: avro/trunk/lang/java/avro/src/test/java/org/apache/avro/util/TestCaseFinder.java
URL: http://svn.apache.org/viewvc/avro/trunk/lang/java/avro/src/test/java/org/apache/avro/util/TestCaseFinder.java?rev=1298117&view=auto
==============================================================================
--- avro/trunk/lang/java/avro/src/test/java/org/apache/avro/util/TestCaseFinder.java (added)
+++ avro/trunk/lang/java/avro/src/test/java/org/apache/avro/util/TestCaseFinder.java Wed Mar  7 21:01:33 2012
@@ -0,0 +1,141 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.avro.util;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.io.BufferedReader;
+import java.io.StringReader;
+
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+import org.junit.runners.Parameterized.Parameters;
+import org.junit.experimental.runners.Enclosed;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+@RunWith(Enclosed.class)
+public class TestCaseFinder {
+
+  @RunWith(Parameterized.class)
+  public static class SimpleCases {
+    String input, label;
+    List<Object[]> expectedOutput;
+
+    public SimpleCases(String input, String label, Object[][] ex) {
+      this.input = input;
+      this.label = label;
+      this.expectedOutput = Arrays.asList(ex);
+    }
+
+    @Parameters
+    public static List<Object[]> cases() {
+      List<Object[]> result = new ArrayList<Object[]>();
+      result.add(new Object[] { "", "foo", new Object[][] { } });
+      result.add(new Object[] { "<<INPUT a\n<<OUTPUT b", "OUTPUT",
+                                new Object[][] { {"a","b"} } });
+      result.add(new Object[] { "<<INPUT a\n<<OUTPUT b\n", "OUTPUT",
+                                new Object[][] { {"a","b"} } });
+      result.add(new Object[] { "<<INPUT a\n<<OUTPUT b\n\n", "OUTPUT",
+                                new Object[][] { {"a","b"} } });
+      result.add(new Object[] { "<<INPUT a\r<<OUTPUT b", "OUTPUT",
+                                new Object[][] { {"a","b"} } });
+      result.add(new Object[] { "// This is a test\n<<INPUT a\n\n\n<<OUTPUT b",
+                                "OUTPUT",
+                                new Object[][] { {"a","b"} } });
+      result.add(new Object[] { "<<INPUT a\n<<OUTPUT\nb\nOUTPUT", "OUTPUT",
+                                new Object[][] { {"a","b"} } });
+      result.add(new Object[] { "<<INPUT a\n<<OUTPUT\nb\nOUTPUT", "OUTPUT",
+                                new Object[][] { {"a","b"} } });
+      result.add(new Object[] { "<<INPUT a\n<<OUTPUT\nb\n\nOUTPUT", "OUTPUT",
+                                new Object[][] { {"a","b\n"} } });
+      result.add(new Object[] { "<<INPUT a\n<<OUTPUT\n\n  b  \n\nOUTPUT",
+                                "OUTPUT",
+                                new Object[][] { {"a","\n  b  \n"} } });
+      result.add(new Object[] { "<<INPUT a\n<<O b\n<<INPUT c\n<<O d", "O",
+                                new Object[][] { {"a","b"}, {"c","d"} } });
+      result.add(new Object[] { "<<INPUT a\n<<O b\n<<F z\n<<INPUT c\n<<O d",
+                                "O",
+                                new Object[][] { {"a","b"}, {"c","d"} } });
+      result.add(new Object[] { "<<INPUT a\n<<O b\n<<F z\n<<INPUT c\n<<O d",
+                                "F",
+                                new Object[][] { {"a","z"} } });
+      result.add(new Object[] { "<<INPUT a\n<<O b\n<<F z\n<<INPUT\nc\nINPUT\n<<O d\n<<INPUT e",
+                                "INPUT",
+                                new Object[][] { {"a",null}, {"c",null}, {"e", null} } });
+      return result;
+    }
+
+    @Test public void testOutput() throws Exception {
+      List<Object[]> result = new ArrayList<Object[]>();
+      CaseFinder.find(mk(input), label, result);
+      assertTrue(pr(result), eq(result, expectedOutput));
+    }
+  }
+
+  public static class NonParameterized {
+    @Test (expected=java.lang.IllegalArgumentException.class)
+    public void testBadDocLabel1() throws Exception {
+      List<Object[]> result = new ArrayList<Object[]>();
+      CaseFinder.find(mk("<<INPUT blah"), "", result);
+    }
+
+    public void testBadDocLabel2() throws Exception {
+      List<Object[]> result = new ArrayList<Object[]>();
+      CaseFinder.find(mk("<<INPUT blah"), "kill-er", result);
+    }
+
+    @Test (expected=java.io.IOException.class)
+    public void testBadSingleLineHeredoc() throws Exception {
+      List<Object[]> result = new ArrayList<Object[]>();
+      CaseFinder.find(mk("<<INPUTblah"), "foo", result);
+    }
+
+    @Test (expected=java.io.IOException.class)
+    public void testUnterminatedHeredoc() throws Exception {
+      List<Object[]> result = new ArrayList<Object[]>();
+      CaseFinder.find(mk("<<INPUT"), "foo", result);
+    }
+  }
+
+  private static BufferedReader mk(String s)
+  { return new BufferedReader(new StringReader(s)); }
+
+  private static String pr(List<Object[]> t) {
+    StringBuilder b = new StringBuilder();
+    b.append("{ ");
+    boolean firstTime = true;
+    for (Object[] p: t) {
+      if (! firstTime) b.append(", "); else firstTime = false;
+      b.append("{ \"").append(p[0]).append("\", \"")
+        .append(p[1]).append("\" }");
+    }
+    b.append("}");
+    return b.toString();
+  }
+
+  private static boolean eq(List<Object[]> l1, List<Object[]> l2) {
+    if (l1 == null || l2 == null) return l1 == l2;
+    if (l1.size() != l2.size()) return false;
+    for (int i = 0; i < l1.size(); i++)
+      if (! Arrays.equals(l1.get(i), l2.get(i))) return false;
+    return true;
+  }
+}

Propchange: avro/trunk/lang/java/avro/src/test/java/org/apache/avro/util/TestCaseFinder.java
------------------------------------------------------------------------------
    svn:eol-style = native

Added: avro/trunk/share/test/data/schema-tests.txt
URL: http://svn.apache.org/viewvc/avro/trunk/share/test/data/schema-tests.txt?rev=1298117&view=auto
==============================================================================
--- avro/trunk/share/test/data/schema-tests.txt (added)
+++ avro/trunk/share/test/data/schema-tests.txt Wed Mar  7 21:01:33 2012
@@ -0,0 +1,192 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+// NOTE: the Java implementation provides a slow-but-direct implementation
+// of the fingerpriting algorithm which is used to cross-check the
+// "fingerprint" values below.  Thus, the Java unit-tests provide validation
+// for these values, so other languages can just assume they are correct.
+
+
+// 000
+<<INPUT "null"
+<<canonical "null"
+<<fingerprint 7195948357588979594
+
+// 001
+<<INPUT {"type":"null"}
+<<canonical "null"
+
+// 002
+<<INPUT "boolean"
+<<canonical "boolean"
+<<fingerprint -6970731678124411036
+
+// 003
+<<INPUT {"type":"boolean"}
+<<canonical "boolean"
+
+// 004
+<<INPUT "int"
+<<canonical "int"
+<<fingerprint 8247732601305521295
+
+// 005
+<<INPUT {"type":"int"}
+<<canonical "int"
+
+// 006
+<<INPUT "long"
+<<canonical "long"
+<<fingerprint -3434872931120570953
+
+// 007
+<<INPUT {"type":"long"}
+<<canonical "long"
+
+// 008
+<<INPUT "float"
+<<canonical "float"
+<<fingerprint 5583340709985441680
+
+// 009
+<<INPUT {"type":"float"}
+<<canonical "float"
+
+// 010
+<<INPUT "double"
+<<canonical "double"
+<<fingerprint -8181574048448539266
+
+// 011
+<<INPUT {"type":"double"}
+<<canonical "double"
+
+// 012
+<<INPUT "bytes"
+<<canonical "bytes"
+<<fingerprint 5746618253357095269
+
+// 013
+<<INPUT {"type":"bytes"}
+<<canonical "bytes"
+
+// 014
+<<INPUT "string"
+<<canonical "string"
+<<fingerprint -8142146995180207161
+
+// 015
+<<INPUT {"type":"string"}
+<<canonical "string"
+
+// 016
+<<INPUT [  ]
+<<canonical []
+<<fingerprint -1241056759729112623
+
+// 017
+<<INPUT [ "int"  ]
+<<canonical ["int"]
+<<fingerprint -5232228896498058493
+
+// 018
+<<INPUT [ "int" , {"type":"boolean"} ]
+<<canonical ["int","boolean"]
+<<fingerprint 5392556393470105090
+
+// 019
+<<INPUT {"fields":[], "type":"record", "name":"foo"}
+<<canonical {"name":"foo","type":"record","fields":[]}
+<<fingerprint -4824392279771201922
+
+// 020
+<<INPUT {"fields":[], "type":"record", "name":"foo", "namespace":"x.y"}
+<<canonical {"name":"x.y.foo","type":"record","fields":[]}
+<<fingerprint 5916914534497305771
+
+// 021
+<<INPUT {"fields":[], "type":"record", "name":"a.b.foo", "namespace":"x.y"}
+<<canonical {"name":"a.b.foo","type":"record","fields":[]}
+<<fingerprint -4616218487480524110
+
+// 022
+<<INPUT {"fields":[], "type":"record", "name":"foo", "doc":"Useful info"}
+<<canonical {"name":"foo","type":"record","fields":[]}
+<<fingerprint -4824392279771201922
+
+// 023
+<<INPUT {"fields":[], "type":"record", "name":"foo", "aliases":["foo","bar"]}
+<<canonical {"name":"foo","type":"record","fields":[]}
+<<fingerprint -4824392279771201922
+
+// 024
+<<INPUT {"fields":[], "type":"record", "name":"foo", "doc":"foo", "aliases":["foo","bar"]}
+<<canonical {"name":"foo","type":"record","fields":[]}
+<<fingerprint -4824392279771201922
+
+// 025
+<<INPUT {"fields":[{"type":{"type":"boolean"}, "name":"f1"}], "type":"record", "name":"foo"}
+<<canonical {"name":"foo","type":"record","fields":[{"name":"f1","type":"boolean"}]}
+<<fingerprint 7843277075252814651
+
+// 026
+<<INPUT
+{ "fields":[{"type":"boolean", "aliases":[], "name":"f1", "default":"true"},
+            {"order":"descending","name":"f2","doc":"Hello","type":"int"}],
+  "type":"record", "name":"foo"
+}
+INPUT
+<<canonical {"name":"foo","type":"record","fields":[{"name":"f1","type":"boolean"},{"name":"f2","type":"int"}]}
+<<fingerprint -4860222112080293046
+
+// 027
+<<INPUT {"type":"enum", "name":"foo", "symbols":["A1"]}
+<<canonical {"name":"foo","type":"enum","symbols":["A1"]}
+<<fingerprint -6342190197741309591
+
+// 028
+<<INPUT {"namespace":"x.y.z", "type":"enum", "name":"foo", "doc":"foo bar", "symbols":["A1", "A2"]}
+<<canonical {"name":"x.y.z.foo","type":"enum","symbols":["A1","A2"]}
+<<fingerprint -4448647247586288245
+
+// 029
+// <<INPUT {"type":"fixed", "name":"foo", "size":015} -- Avro parser broken???
+<<INPUT {"name":"foo","type":"fixed","size":15}
+<<canonical {"name":"foo","type":"fixed","size":15}
+<<fingerprint 1756455273707447556
+
+// 030
+<<INPUT {"namespace":"x.y.z", "type":"fixed", "name":"foo", "doc":"foo bar", "size":32}
+<<canonical {"name":"x.y.z.foo","type":"fixed","size":32}
+<<fingerprint -3064184465700546786
+
+// 031
+<<INPUT { "items":{"type":"null"}, "type":"array"}
+<<canonical {"type":"array","items":"null"}
+<<fingerprint -589620603366471059
+
+// 032
+<<INPUT { "values":"string", "type":"map"}
+<<canonical {"type":"map","values":"string"}
+<<fingerprint -8732877298790414990
+
+// 033
+<<INPUT
+  {"name":"PigValue","type":"record",
+   "fields":[{"name":"value", "type":["null", "int", "long", "PigValue"]}]}
+INPUT
+<<canonical {"name":"PigValue","type":"record","fields":[{"name":"value","type":["null","int","long","PigValue"]}]}
+<<fingerprint -1759257747318642341

Propchange: avro/trunk/share/test/data/schema-tests.txt
------------------------------------------------------------------------------
    svn:eol-style = native



Mime
View raw message