avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-160) file format should be friendly to streaming
Date Mon, 14 Dec 2009 20:05:21 GMT

    [ https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790320#action_12790320

Doug Cutting commented on AVRO-160:

I'm now having second thoughts about the current proposal to include the schema with each
block.  We're going through a lot of work in order to support changing the schema within a
file, yet I don't actually believe that to be a common usage.  I wonder if instead we should
simply make the schema a part of the file header and not permit it to be modified while writing.
 This would support mapreduce well.  If someone wishes to modify or intermix schemas, then
they have to copy their data to a new file, using a new schema.

So, my new, reductionist approach is that a data file has just:
 - a header with
  -- a magic number identifying this file format (incremented from current data file)
  -- a sync marker
  -- a json-format schema
  -- a compression codec name (default is null)
  -- an avro encoding name (text/binary, default is binary)
  -- optionally other, user-provided metadata
- followed by a sequence of blocks, each with:
  -- the sync marker from the header
  -- the count of instances in this block
  -- the length in bytes of this compressed block
  -- the compressed block data
   --- a sequence of 'count' entries corresponding to the header's schema

That's it.  Thoughts?


> file format should be friendly to streaming
> -------------------------------------------
>                 Key: AVRO-160
>                 URL: https://issues.apache.org/jira/browse/AVRO-160
>             Project: Avro
>          Issue Type: Improvement
>          Components: spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
> It should be possible to stream through an Avro data file without seeking to the end.
> Currently the interpretation is that schemas written to the file apply to all entries
before them.  If this were changed so that they instead apply to all entries that follow,
and the initial schema is written at the start of the file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to, if it is
a union, to add new branches at the end of that union.  If it is not a union, no changes may
be made.  So it is still the case that the final schema in a file can read every entry in
the file and thus may be used to randomly access the file.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message