avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-859) Java: Data Flow Overhaul -- Composition and Symmetry
Date Thu, 14 Jul 2011 17:30:59 GMT

    [ https://issues.apache.org/jira/browse/AVRO-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065402#comment-13065402
] 

Scott Carey commented on AVRO-859:
----------------------------------

h4. Functional Composition
All read and write operations can be broken into functional bits and composed rather than
writing monolithic classes.  This allows a "DatumWriter2" to be a graph of functions that
pre-compute all state required from a schema rather than traverse a schema for each write.
 Additionally, if the functions are all of a common set of types, it becomes easy to use code
generation:  either directly or by parsing the resulting function graph and converting to
code that the JVM can better optimize.

h4. Symmetry
Avro's data flow can be made symmetric.  Rather than thinking in terms of Read and Write,
think in terms of:
* _*Source*_: Where data that is represented by an Avro schema comes from -- this may be a
Decoder, or an Object graph.
* _*Target*_: Where data that represents an Avro schema is sent -- this may be an Encoder
or an Object graph.

Combine the two ideas together and you can create _*Flows*_ -- The combination of a Source
and a Target for a specific Schema (or resolvable Schema pair).
The machinery that requires traversing and resolving schemas can be written once, and "DatumReader"
written once, with different source and targets combined to make different tools:
* An Decoder source + GenericData target = GenericDatumReader
* A SpecificData source +  Encoder target = GenericDatumWriter
* BinaryDecoder source + JsonEncoder target = transform from binary to json without any intermediate
objects!
* SpecificData source + GeneridData target = transform one object type to another

Add in new sources and targets (Pig, ProtoBuf, Thrift objects; Pig binary, Protobuf binary,
Thrift binary) and you can mix/match more transformation tasks.

Additinally, one can write a generic Equals/Compare implementation that takes two Sources,
and compares them or checks for equality.  Then, you can compare binary with an object, or
two objects.
Data flow could also tee:  one source with many targets.



h4. Functional units
After much prototyping and desingn, I have identified that all Avro data flow can be done
by the composition of two functors:
The Unary Functor, which I have named *Access*: 
{code}
Access<A,B> {
 B access(A a);
}
{code}
And a Binary Functor with two types named *Flow*:
{code}
Flow<A,B> {
 B flow(A a, B b);
}
{code}
In most cases, you can replace "A" with "FROM" and "B" with "TO" in relation to Target and
Source concepts.  These functions can naturally compose in all the ways required for data
to flow from a target to a source.

.h4 Making Symmetry
Consider this simple example, a Flow over the schema: 
{code}
{"type": "record", "name":"Foo", "fields":
  [{"type":"int"}]
}
{code}

In the current implementation, a GenericDatumReader has the following API:
{code}
D read(D reuse, Decoder in);
{code}
which internally parses a Schema step by step, recursively calling methods with a similar
signature.
When we get to the leaf field, we return an integer, and on return insert that into a GenericData.Record
as the first field.
A very similar process occurs with GenericDatumWriter:
{code}
void write(D datum, Encoder out);
{code}
Which traverses a schema, recursively calling methods with a similar signature.
On the way down the schema graph, we access objects and pass portions of the data through,
and when we hit the leaf field, we write it to the encoder and return.

Consider the innermost operation for both of the above:
Fetch an integer, then put it somewhere:
|| step || Source || Target || Source op || Target op || flow signature ||
| read an integer | IndexedRecord | Encoder | IndexedRecord.get() | (null) | int access(IndexedRecord)
|
| read an integer | Decoder | IndexedRecord | Decoder.readInt() | (null) | int access(Decoder)
|
| send integer to output | IndexedRecord | Encoder | (null) | Encoder.writeInt() | Encoder
flow(int, Encoder) |
| send integer to output | Decoder | IndexedRecord | (null) | IndexedRecord.put() | IndexedRecord
flow(int, IndexedRecord) |

The access and flow signatures compose as follows:
{code}
int access(A);
 FollowedBy
B flow(int, B);
Equals:

B flow(A, B);
{code}

So the above two examples compose to:
|| step || Source || Target || Source op || Target op || flow signature ||
| int flow | IndexedRecord | Encoder | IndexedRecord.get() | Encoder.writeInt() | Encoder
flow(IndexedRecord, Encoder) |
| int flow | Decoder | IndexedRecord | Decoder.readInt() | IndexedRecord.put() | IndexedRecord
flow(Decoder, IndexedRecord) |

As can be seen, one can compose the following two functions for an integer field, one function
provided by the Source, and one function provided by the Target, and produce a Flow of data
between them.
The source and target each have their own contexts -- the object types that an integer field
represents -- but to not have to know anything about the other side.  The flow composition
also does not need any information about the source or target -- they meet only at "int".

> Java: Data Flow Overhaul -- Composition and Symmetry
> ----------------------------------------------------
>
>                 Key: AVRO-859
>                 URL: https://issues.apache.org/jira/browse/AVRO-859
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>
> Data flow in Avro is currently broken into two parts:  Read and Write.  These share many
common patterns but almost no common code.  
> Additionally, the APIs for this are DatumReader and DatumWriter, which requires that
implementations know how to traverse Schemas and use the Resolver.
> This is a proposal to overhaul the inner workings of Avro Java between the Decoder/Encoder
APIs and DatumReader/DatumWriter such that there is significantly more code re-use and much
greater opportunity for new features that can all share in general optimizations and dynamic
code generation.
> The two primary concepts involved are:
> * _*Functional Composition*_
> * _*Symmetry*_
> h4. Functional Composition
> All read and write operations can be broken into functional bits and composed rather
than writing monolithic classes.  This allows a "DatumWriter2" to be a graph of functions
that pre-compute all state required from a schema rather than traverse a schema for each write.
> h4. Symmetry
> Avro's data flow can be made symmetric.  Rather than thinking in terms of Read and Write,
think in terms of:
> * _*Source*_: Where data that is represented by an Avro schema comes from -- this may
be a Decoder, or an Object graph.
> * _*Target*_: Where data that represents an Avro schema is sent -- this may be an Encoder
or an Object graph.
> (More detail in the comments)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message