avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <scottca...@apache.org>
Subject Re: Reader / Writer terminology
Date Sat, 08 Jun 2013 21:51:49 GMT
I'm about to make all of this even more confusingĀŠ

For pair-wise resolution when the operation is deserialization, "reader" and
"writer" make sense.  In a more general sense it is simply "from" and "to"
-- One might move from schema A to B without serialization at all,
transforming a data structure, or simply want a view of data in the form of
A as if it was in B.   There aren't any clear naming winners and many sound
good for one use case but worse for others:  'source' and 'destination',
'source' and 'sink', 'original' and 'target', 'expected' and 'actual',
'reader' and 'writer', 'resolver' and 'resolvee', 'sender' and 'reciever'.

As part of AVRO-1124 I have recently met in person with a few folks who
needed enhancements to that ticket (the discussion and conclusion will be
added there shortly, prior to the next patch version).
The result is that two names are not enough, because expressing resolution
of _sets_ of schemas is more complicated than pairs.

When describing a set of schemas that represent some sort of data that may
have been persisted,  six states are needed.   The six states are made up of
two dimensions.   
* The "reader" dimension is binary, and represents whether a schema is used
for reading or not (is ever a "to", "reader", or "target").
* The "write" dimension has three states in the 'write' spectrum:  Writer
(an active "from" or "source"), Written (persisted data, not actively
written), and None (not used for writing).

The naming of these will be confusing, as part of AVRO-1124 we'll have to
have names that are as clear as possible.  Currently I have enumerations:
ReadState.READER and ReadState.NONE;  WriteState.WRITER, WriteState.WRITTEN,
and WriteState.NONE.   I am not a big fan of these names, and am open to
suggestions.   A consistent approach in naming is important.   For example,
I previously had, WriteState.WRITTEN named WriteState.READABLE.  That
represents the idea of what the state is for the best, but is extremely
confusing.

These six states relate with one schema resolution rule:
Schemas in state ReadState.READER must be able to read all schemas with
WriterState.WRITER or WriterState.WRITTEN.

and one rule for persisting data:
Data must not be persisted unless the corresponding schema is in state
WriterState.WRITER

Without going into the details, this allows for any schema evolution use
case over a set of schemas with both ephemeral data and persisted data.
Schemas can transition from one state to another, as long as the constraint
rules above are met at all times.


"Reader" and "Writer" have been useful because they correlate with other
meaningful names well -- hypothetically:
   boolean mySchema.canRead(Schema writer) and
   boolean mySchema.canBeReadWith(Schema reader)

A naming scheme for describing schema resolution an evolution will need to
work across many use cases and be useful for describing relationships
between schemas.  Describing only the pair-wise resolution is not enough.

On 6/8/13 12:44 AM, "Doug Cutting" <cutting@apache.org> wrote:

> Originally I used the term 'actual' for the schema of the data written and
> 'expected' for the schema that the reader of the data wished to see it as.
> Some found those terms confusing and suggested that 'writer' and 'reader' were
> more intuitive, so we started using those instead. That unfortunately seems
> not to have resolved the confusion entirely.
> 
> Perhaps we should improve the documentation around this? Do you have any
> specific suggestions about how that might be done?
> 
> Doug
> 
> On Jun 7, 2013 10:12 PM, "Gregory (Grisha) Trubetskoy" <grisha@apache.org>
> wrote:
>> 
>> I'm curious how the "Reader" and "Writer" terminology came about, and, most
>> importantly, whether it's as confusing to the rest of you as it is to me?
>> 
>> As I understand it, the principal analogy here is from the RPC world - a
>> process A writes some Avro to process B, in which case A is the writer and B
>> is the reader.
>> 
>> And there is the possibility that the schema which B may be expecting isn't
>> what A is providing, thus B may have to do some conversion on its end to grok
>> it, and Avro schema resolution rules may make this possible.
>> 
>> So far so good. This is where it becomes confusing. I am lost on how the act
>> of reading or writing is relevant to the task at hand, which is conversion of
>> a value from one schema to another.
>> 
>> As I read stuff on the lists and the docs, I couldn't help noticing words
>> such as "original", "first", "second", "actual, "expected" being using
>> alongside "reader" and "writer" as clarification.
>> 
>> Why would be wrong with a "source" and "destination" schmeas?
>> 
>> Consider the following line (from Avro-C):
>> 
>>     writer_iface = avro_resolved_writer_new(writer_schema, reader_schema);
>> 
>> Here "writer" in resolved_writer and writer_schema are unrelated. The former
>> refers to the fact that this interface will be modifying (writing to) an
>> object, the latter is referring to the writer (source, original, a.k.a
>> actual) schema.
>> 
>> Wouldn't this read better as:
>> 
>>     writer_iface = avro_resolved_writer_new(source_schema, dest_schema);
>> 
>> Anyway - I just want to know if I'm missing something obvious when I think
>> that reader/writer is confusing.
>> 
>> Thanks,
>> 
>> Grisha



Mime
View raw message