avro-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cutt...@apache.org
Subject svn commit: r1414978 - in /avro/trunk: CHANGES.txt lang/java/avro/src/main/java/org/apache/avro/io/parsing/doc-files/parsing.html
Date Wed, 28 Nov 2012 22:42:35 GMT
Author: cutting
Date: Wed Nov 28 22:42:34 2012
New Revision: 1414978

URL: http://svn.apache.org/viewvc?rev=1414978&view=rev
Log:
AVRO-1178. Java: Fix typos in parsing document. Contributed by Martin Kleppmann.

Modified:
    avro/trunk/CHANGES.txt
    avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/parsing/doc-files/parsing.html

Modified: avro/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/avro/trunk/CHANGES.txt?rev=1414978&r1=1414977&r2=1414978&view=diff
==============================================================================
--- avro/trunk/CHANGES.txt (original)
+++ avro/trunk/CHANGES.txt Wed Nov 28 22:42:34 2012
@@ -39,6 +39,9 @@ Trunk (not yet released)
     AVRO-1210. Java: Fix mistakes in AvroMultipleOutputs error messages.
     (Dave Beech via cutting)
 
+    AVRO-1178. Java: Fix typos in parsing document.
+    (Martin Kleppmann via cutting)
+
   BUG FIXES
 
     AVRO-1171. Java: Don't call configure() twice on mappers & reducers.

Modified: avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/parsing/doc-files/parsing.html
URL: http://svn.apache.org/viewvc/avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/parsing/doc-files/parsing.html?rev=1414978&r1=1414977&r2=1414978&view=diff
==============================================================================
--- avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/parsing/doc-files/parsing.html
(original)
+++ avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/parsing/doc-files/parsing.html
Wed Nov 28 22:42:34 2012
@@ -24,7 +24,7 @@
 
 This document shows how an Avro schema can be interpreted as the definition of a context-free
grammar in LL(1).  We use such an interpretation for two use-cases.  In one use-case, we use
them to validate readers and writers of data against a single Avro schema.  Specifically,
sequences of <code>Encoder.writeXyz</code> methods can be validated against a
schema, and similarly sequences of <code>Decoder.readXyz</code> methods can be
validated against a schema.
 
-The second use-case is using grammars to perform schema resolution.  For this use-case, we've
developed a subclass of <code>Decoder</code> which takes two Avro schemas as input
-- a reader and a writer schema.  This subclass accepts an input stream written according
to the writer schema, and presents it to a client expecting the reader schema.  If the writer
writes a long, for example, where the reader expects a double, then the <code>Decoder.readDoubl</code>
method will convert the writer's long into a double.
+The second use-case is using grammars to perform schema resolution.  For this use-case, we've
developed a subclass of <code>Decoder</code> which takes two Avro schemas as input
-- a reader and a writer schema.  This subclass accepts an input stream written according
to the writer schema, and presents it to a client expecting the reader schema.  If the writer
writes a long, for example, where the reader expects a double, then the <code>Decoder.readDouble</code>
method will convert the writer's long into a double.
 
 This document looks at grammars in the context of these two use-cases.  We first look at
the single-schema case, then the double-schema case.  In the future, we believe the interpretation
of Avro schemas as CFGs will find other uses (for example, to determine whether or not a schema
admits finite-sized values).
 
@@ -33,7 +33,7 @@ This document looks at grammars in the c
 
 <p> We parse a schema into a set of JSON objects.  For each record, map, array, union
schema inside this set, this parse is going to generate a unique identifier "n<sub>i</sub>"
(the "pointer" to the schema).  By convention, n<sub>0</sub> is the identifier
for the "top-level" schema (i.e., the schema we want to read or write).  In addition, where
n<sub>i</sub> is a union, the parse will generate a unique identifier "b<sub>ij</sub>"
for each branch of the union.
 
-<p> A context-free grammar (CFG) consists of a set of terminal-symbols, a set of non-terminal
symbols, a set of productions, and a start symbol.  Here's how we interpret an Avro schema
as a CFG:
+<p> A context-free grammar (CFG) consists of a set of terminal symbols, a set of non-terminal
symbols, a set of productions, and a start symbol.  Here's how we interpret an Avro schema
as a CFG:
 
 <p> <b>Terminal symbols:</b> The terminal symbols of the CFG consist of
<code>null</code>, <code>bool</code>, <code>int</code>,
<code>long</code>, <code>float</code>, <code>double</code>,
<code>string</code>, <code>bytes</code>, <code>enum</code>,
<code>fixed</code>, <code>arraystart</code>, <code>arrayend</code>,
<code>mapstart</code>, <code>mapend</code>, and <code>union</code>.
 In addition, we define the special terminals <code>"1"</code>, <code>"2"</code>,
<code>"3"</code>, <code>...</code> which designate the "tag" of a
union (i.e., which branch of the union is actually being written or was found in the data).
 
@@ -206,7 +206,7 @@ Note that <code>T</code> is defined as <
 
 <p>The first section ("The interpretation") informally describes the grammer generated
by an Avro schema.  This section provides a more formal description using a set of induction
rules.  The earlier description in section one is fine for describing how a single Avro schema
generates a grammar.  But soon we're going to describe how two schemas together define a "resolving"
grammar, and for that description we'll need the more formal mechanism described here.
 
-<p>The terminal and non-terminal symbols in our grammar are as described in the first
section.  Our induction rules will define a function "C(S)=&lt;G,a&gt;", which takes
an Avro schema "S" and returns a pair consisting of a set of productions "X" and a symbol
"a".  This symbol "a" -- which is either a terminal, or a non-terminal defined by G -- generates
the values described by schema S.
+<p>The terminal and non-terminal symbols in our grammar are as described in the first
section.  Our induction rules will define a function "C(S)=&lt;G,a&gt;", which takes
an Avro schema "S" and returns a pair consisting of a set of productions "G" and a symbol
"a".  This symbol "a" -- which is either a terminal, or a non-terminal defined by G -- generates
the values described by schema S.
 
 <p>The first rule applies to all Avro primitive types:
 
@@ -223,14 +223,14 @@ Note that <code>T</code> is defined as <
 <table align=center>
   <tr><td align=center>
   <table cellspacing=0 cellpadding=0><tr><td>S=</td><td><code>{"type":"record",
"name":</code>a<code>,</code></td></tr>
-         <tr><td></td><td><code>"fields":[{"name":</code>F<sub>1</sub><code>,
"type":</code>S<sub>1</sub><code>},</code>...<code>, {"name":</code>F<sub>n</sub><code>,
"type":</code>S<sub>n</sub><code>}]}</code></td></tr></table></td></tr>
+         <tr><td></td><td><code>"fields":[{"name":</code>F<sub>1</sub><code>,
"type":</code>S<sub>1</sub><code>}, ..., {"name":</code>F<sub>n</sub><code>,
"type":</code>S<sub>n</sub><code>}]}</code></td></tr></table></td></tr>
   <tr align=center><td>C(S<sub>j</sub>)=&lt;G<sub>j</sub>,
f<sub>j</sub>&gt;</td></tr>
   <tr align=center><td><hr></td></tr>
   <tr align=center><td>C(S)=&lt;G<sub>1</sub> &#8746; ...
&#8746; G<sub>n</sub> &#8746; {a::=f<sub>1</sub> f<sub>2</sub>
... f<sub>n</sub>}, a&gt;</td></tr>
 </tr>
 </table>
 
-<p>In this case, the set of output-productions consists of all the productions generated
by the element-types of the record, plus a production that defines the non-terminal "n" to
be the sequence of field-types.  We return "n"as the grammar symbol representing this record-schema.
+<p>In this case, the set of output-productions consists of all the productions generated
by the element-types of the record, plus a production that defines the non-terminal "a" to
be the sequence of field-types.  We return "a" as the grammar symbol representing this record-schema.
 
 <p>Next, we define the rule for arrays:
 
@@ -241,7 +241,7 @@ Note that <code>T</code> is defined as <
   <tr align=center><td>C(S)=&lt;G<sub>e</sub> &#8746; {r
::= e r, r ::= &#949;, a ::= <code>arraystart</code> r <code>arrayend</code>},
a&gt;</td></tr>
 </table>
 
-<p>For arrays, the set of output productions again contains all productions generated
by the element-type.  In addition, we define <em>two</em> productions for "r",
which represents the repetition of this element type.  The first production is the recursive
case, which consists of the element-type followed by "r" all over again.  The next case is
the base case, which is the empty production.  Having defined this repetition, we can then
define "n" as this repetation bracketed by the terminal symbols <code>arraystart</code>
and <code>arrayend</code>.
+<p>For arrays, the set of output productions again contains all productions generated
by the element-type.  In addition, we define <em>two</em> productions for "r",
which represents the repetition of this element type.  The first production is the recursive
case, which consists of the element-type followed by "r" all over again.  The next case is
the base case, which is the empty production.  Having defined this repetition, we can then
define "a" as this repetition bracketed by the terminal symbols <code>arraystart</code>
and <code>arrayend</code>.
 
 <p>The rule for maps is almost identical to that for arrays:
 
@@ -257,7 +257,7 @@ Note that <code>T</code> is defined as <
 <p>The rule for unions:
 <table align=center>
 <tr align=center>
- <td>S=[S<sub>1</sub>, S<sub>2</sub><code>, ..., S<sub>n</sub>]</td>
+<td>S=<code>[S<sub>1</sub>, S<sub>2</sub>, ..., S<sub>n</sub>]</code></td>
 </tr>
 <tr align=center>
  <td>C(S<sub>j</sub>)=&lt;G<sub>j</sub>, b<sub>j</sub>&gt;</td>
@@ -266,30 +266,30 @@ Note that <code>T</code> is defined as <
 <tr align=center><td>C(S)=&lt;G<sub>1</sub> &#8746; ... &#8746;
G<sub>n</sub> &#8746; {u::=1 b<sub>1</sub>, u::=2 b<sub>2</sub>,
..., u::=n b<sub>n</sub>, a::=<code>union</code> u}, a&gt;</td></tr>
 </table>
 
-<p>In this rule, we again accumulate productions (G<sub>j</sub>)generated
by each of the sub-schemas contained by the top-level schemas.  If there are "k" branches,
we define "k" different productions for the non-terminal symbol "u", one for each branch in
the union.  These per-branch productions consist of the index of the branch (1 for the first
branch, 2 for the second, and so-forth), followed by the symbol representing the schema of
that branch.  With these productions for "u" defined, we can define "n" as simply the terminal-symbol
<code>union</code> followed by this non-terminal "u".
+<p>In this rule, we again accumulate productions (G<sub>j</sub>) generated
by each of the sub-schemas for each branch of the union.  If there are "k" branches, we define
"k" different productions for the non-terminal symbol "u", one for each branch in the union.
 These per-branch productions consist of the index of the branch (1 for the first branch,
2 for the second, and so forth), followed by the symbol representing the schema of that branch.
 With these productions for "u" defined, we can define "a" as simply the terminal symbol <code>union</code>
followed by this non-terminal "u".
 
 
 <p>The rule for fixed size binaries:
 <table align=center>
 <tr align=center>
- <td>S=<code>{"type"="fixed", "name"=a, "size"=s}</code></td>
+ <td>S=<code>{"type":"fixed", "name":a, "size":s}</code></td>
 </tr>
 <tr align=center><td><hr></td></tr>
 <tr align=center><td>C(S)=&lt;{a::=<code>fixed</code> f, f::=&#949;},
a&gt;</td></tr>
 </table>
 
-<p>In this rule, we define a new non-termial f which has associated size of the fixed-binary.
+<p>In this rule, we define a new non-terminal f which has associated size of the fixed-binary.
 
 <p>The rule for enums:
 <table align=center>
 <tr align=center>
- <td>S=<code>{"type"="enum", "name"=a, "symbols"=["s1", "s2", "s3", ...]}</code></td>
+ <td>S=<code>{"type":"enum", "name":a, "symbols":["s1", "s2", "s3", ...]}</code></td>
 </tr>
 <tr align=center><td><hr></td></tr>
 <tr align=center><td>C(S)=&lt;{a::=<code>enum</code> e, e::=&#949;},
a&gt;</td></tr>
 </table>
 
-<p>In this rule, we define a new non-termial f which has associated range of values.
+<p>In this rule, we define a new non-terminal e which has associated range of values.
 
 <h1>Resolution using action symbols</h1>
 
@@ -308,12 +308,12 @@ We want to use grammars to represent Avr
 
 <p> <li> <b>Enum actions:</b> when we have reader- and writer-schema
has enumerations, enum actions are used to map the writer's numerical value to the reader's
numeric value.
 
-<p> <li> <b>Error actions:</b> in general, errors in schema-resolution
can only be detected when data is being read.  For example, if the writer writers a <code>[long,&nbsp;string]</code>
union, and the reader is expecting just a <code>long</code>, an error is only
reported when the writer sends a string rather than a long.  Further, the Avro spec recommends
that <em>all</em> errors be detected at reading-time, even if they could be detected
earlier.  Error actions support the deferral of errors.
+<p> <li> <b>Error actions:</b> in general, errors in schema-resolution
can only be detected when data is being read.  For example, if the writer writes a <code>[long,&nbsp;string]</code>
union, and the reader is expecting just a <code>long</code>, an error is only
reported when the writer sends a string rather than a long.  Further, the Avro spec recommends
that <em>all</em> errors be detected at reading-time, even if they could be detected
earlier.  Error actions support the deferral of errors.
 </ul>
 
 <p>These actions will become "action symbols" in our grammar.  Action symbols are symbols
that cause our parser to perform special activities when they appear on the top of the parsing
stack.  For example, when the skip-action makes it to the top of the stack, the parser will
automatically skip the next value in the input stream.  (Again, Fischer and LeBlanc has a
nice description of action symbols.)
 
-<p>We're going to use induction rules to define a grammar.  This time, our induction
rules will define a two-argument function "C(W,R)=&lt;G,a&gt;", which takes two schema,
the writer's and reader's schemas respectively.  The results of this function the same as
they where for the single-schema case.
+<p>We're going to use induction rules to define a grammar.  This time, our induction
rules will define a two-argument function "C(W,R)=&lt;G,a&gt;", which takes two schema,
the writer's and reader's schemas respectively.  The results of this function are the same
as they were for the single-schema case.
 
 <p>The first rule applies to all Avro primitive types:
 
@@ -337,7 +337,7 @@ We want to use grammars to represent Avr
 
 <p> When this parameterized action is encountered, the parser will resolve the writer's
value into the reader's expected-type for that value.  In the parsing loop, when we encounter
this symbol, we use the "r" parameter of this symbol to check that the reader is asking for
the right type of value, and we use the "w" parameter to figure out how to parse the data
in the input stream.
 
-<p>On final possibility for pimitive types are incompatible types:
+<p>One final possibility for primitive types is that they are incompatible types:
 
 <table align=center>
   <tr align=center><td>The w,r pair does not fit the previous two rules, AND
neither</td></tr>
@@ -347,21 +347,21 @@ We want to use grammars to represent Avr
   <tr align=center><td>C(w,r)=&lt;{}, ErrorAction&gt;</td></tr>
 </table>
 
-<p> When this parameterized action is encountered, the parser will throw an error.
 Keep in mind that this symbol might be generated in the middle of a recursive call to "G."
 For example, if the reader's schema is long, and the writers is [long,&nbsp;string],
we'll generate an error symbol for the string-branch of the union; if this branch is occurred
in actual input, an error will then be generated.
+<p> When this parameterized action is encountered, the parser will throw an error.
 Keep in mind that this symbol might be generated in the middle of a recursive call to "G."
 For example, if the reader's schema is long, and the writer's is [long,&nbsp;string],
we'll generate an error symbol for the string-branch of the union; if this branch is occurred
in actual input, an error will then be generated.
 
-<p>The next rule deals with resolution fixed size binaries:
+<p>The next rule deals with resolution of fixed size binaries:
 
 <table align=center>
-  <tr align=center><td>w = {"type"="fixed", "name":"n1", "size"=s1}</td></tr>
-  <tr align=center><td>r = {"type"="fixed", "name":"n2", "size"=s2}</td></tr>
+  <tr align=center><td>w = <code>{"type":"fixed", "name":"n1", "size":s1}</code></td></tr>
+  <tr align=center><td>r = <code>{"type":"fixed", "name":"n2", "size":s2}</code></td></tr>
   <tr align=center><td>n1 != n2 or s1 != s2</td></tr>
   <tr><td><hr></td></tr>
   <tr align=center><td>C(w,r)=&lt;{}, ErrorAction&gt;</td></tr>
 </table>
 
 <table align=center>
-  <tr align=center><td>w = {"type"="fixed", "name":"n1", "size"=s1}</td></tr>
-  <tr align=center><td>r = {"type"="fixed", "name":"n2", "size"=s2}</td></tr>
+  <tr align=center><td>w = <code>{"type":"fixed", "name":"n1", "size":s1}</code></td></tr>
+  <tr align=center><td>r = <code>{"type":"fixed", "name":"n2", "size":s2}</code></td></tr>
   <tr align=center><td>n1 == n2 and s1 == s2</td></tr>
   <tr><td><hr></td></tr>
   <tr align=center><td>C(w,r)=&lt;{ a::=<code>fixed</code> f,
f::=&#949;}, a&gt;</td></tr>
@@ -369,11 +369,11 @@ We want to use grammars to represent Avr
 
 If the names are identical and sizes are identical, then we match otherwise an error is generated.
 
-<p>The next rule deals with resolution enums:
+<p>The next rule deals with resolution of enums:
 
 <table align=center>
-  <tr align=center><td>w = {"type"="enum", "symbols":[sw<sub>1</sub>,
sw<sub>2</sub>, ..., sw<sub>m</sub>] }</td></tr>
-  <tr align=center><td>r = {"type"="enum", "symbols":[sr<sub>1</sub>,
sr<sub>2</sub>, ..., sr<sub>n</sub>] }</td></tr>
+  <tr align=center><td>w = <code>{"type":"enum", "symbols":[sw<sub>1</sub>,
sw<sub>2</sub>, ..., sw<sub>m</sub>] }</code></td></tr>
+  <tr align=center><td>r = <code>{"type":"enum", "symbols":[sr<sub>1</sub>,
sr<sub>2</sub>, ..., sr<sub>n</sub>] }</code></td></tr>
   <tr align=center><td>f<sub>i</sub> = EnumAction(i, j) if sw<sub>i</sub>
== sr<sub>j</sub></td></tr>
   <tr align=center><td>f<sub>i</sub> = ErrorAction if sw<sub>i</sub>
does not match any sr<sub>j</sub></td></tr>
   <tr><td><hr></td></tr>
@@ -456,11 +456,11 @@ The symbol e has the set of actions f<su
 
 <p>The substance of this rule lies in the definion of the "f'<sub>j</sub>".
 If the writer's field F<sub>j</sub> is not a member of the reader's schema, then
a skip-action is generated, which will cause the parser to automatically skip over the field
without the reader knowing.  (In this case, note that we use the <em>single</em>-argument
version of "C", i.e., the version defined in the previous section!)
 
-If the wrtier's field F<sub>j</sub> <em>is</em> a member f the reader's
schema, then "f'<sub>j</sub>" is a two-symbol sequence: the first symbol is a
(parameterized) field-action which is used to tell the reader which of it's own fields is
coming next, followed by the symbol for parsing the value written by the writer.
+If the writer's field F<sub>j</sub> <em>is</em> a member f the reader's
schema, then "f'<sub>j</sub>" is a two-symbol sequence: the first symbol is a
(parameterized) field-action which is used to tell the reader which of its own fields is coming
next, followed by the symbol for parsing the value written by the writer.
 
 <p>The above rule for records works only when the reader and writer have the same name,
and the reader's fields are subset of the writer's.  In other cases, an error is producted.
 
-<p> The rule for arrays is straight forward:
+<p>The rule for arrays is straightforward:
 
 <table align=center>
 <tr align=center>
@@ -473,7 +473,7 @@ If the wrtier's field F<sub>j</sub> <em>
  <td>C(S<sub>w</sub>, S<sub>r</sub>)=&lt;G<sub>e</sub>,e&gt;
 </tr>
 <tr><td><hr></td></tr>
-<tr align=center><td>C(W,R)=&lt;G<sub>e</sub> U {r ::= e r, r
::= &#949;, a ::= <code>arraystart</code> r <code>arrayend}, a&gt;</td></tr>
+<tr align=center><td>C(W,R)=&lt;G<sub>e</sub> &#8746; {r
::= e r, r ::= &#949;, a ::= <code>arraystart</code> r <code>arrayend},
a&gt;</td></tr>
 </table>
 
 <p>Here the rule is largely the same as for the single-schema case, although the recursive
use of G may result in productions that are very different.  The rule for maps changes in
a similarly-small way, so we don't bother to detail that case in this document.
@@ -522,7 +522,7 @@ If the wrtier's field F<sub>j</sub> <em>
  <td>R=[R<sub>1</sub>, ..., R<sub>n</sub>]</td>
 </tr>
 <tr><td align=center>Branch "j" of R is the best match for W</td></tr>
-<tr><td align=center>C(W,R<sub>j</sub>)=&lt;&nbsp;G,w&gt;</td></tr>
+<tr><td align=center>C(W,R<sub>j</sub>)=&lt;G,w&gt;</td></tr>
 <tr><td><hr></td></tr>
 <tr><td align=center>C(W,R)=&lt;G, ReaderUnionAction(j,w)&gt;</td></tr>
 </table>
@@ -589,7 +589,7 @@ Here's a stylized version of the actual 
           else, T(X,t) is undefined, so throw an error;
 
       X = stack.pop();
-    }
+
     // We've left the loop, so X is a terminal symbol:
     case X:
       ResolvingTable(w,r):
@@ -611,5 +611,5 @@ Here's a stylized version of the actual 
       
     // Fall-through case:
     if (X == t) then return X
-    else throw an aerror 
+    else throw an error
 </pre>



Mime
View raw message