avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip Zeyliger (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-160) file format should be friendly to streaming
Date Thu, 22 Oct 2009 18:27:59 GMT

    [ https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768806#action_12768806
] 

Philip Zeyliger commented on AVRO-160:
--------------------------------------

Ok, that makes sense.

For some reason, I thought you could write AAAAAXBBBBBBY where records A are written with
schema X, and then records B are written with schema Y, where X and Y are resolvable using
schema resolution.  But that doesn't work because though X and Y may be resolvable, they may
not have the same serialization.

So, it turns out there are two types of schema compatibility: writer-reader compatibility,
which means that we can read when we have both schemas available, and writer-writer compatibility,
which concerns whether we can read (or write) data with only one of the two schemas.  I don't
like those names, though.

There's something appealing about writing the schema frequently.  You could also store an
offset pointer to the schema in every block header, instead of the entire thing.

What use cases are you thinking about?
 * Map/reduce outputs tend to be uniform, since it's unlikely that a M/R program changes its
output in media res.
 * Map/reduce inputs might be heterogeneous because you're combining logs from last year with
logs from this year, though it's likely that individual files are homogeneous.  (And if you
bother to combine files, you may as well do the schema resolution as part of the concatentation,
and keep the new file homogeneous.)
 * HBase cells are not likely to use this format, but rather keep the schema per column.
 * An individual program's log files are likely to be homogeneous.  There's no harm in starting
a new log file when you upgrade, rather than appending to the old one.

> file format should be friendly to streaming
> -------------------------------------------
>
>                 Key: AVRO-160
>                 URL: https://issues.apache.org/jira/browse/AVRO-160
>             Project: Avro
>          Issue Type: Improvement
>          Components: spec
>            Reporter: Doug Cutting
>
> It should be possible to stream through an Avro data file without seeking to the end.
> Currently the interpretation is that schemas written to the file apply to all entries
before them.  If this were changed so that they instead apply to all entries that follow,
and the initial schema is written at the start of the file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to, if it is
a union, to add new branches at the end of that union.  If it is not a union, no changes may
be made.  So it is still the case that the final schema in a file can read every entry in
the file and thus may be used to randomly access the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message