avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-160) file format should be friendly to streaming
Date Thu, 22 Oct 2009 23:45:59 GMT

    [ https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768975#action_12768975

Scott Carey commented on AVRO-160:

{quote}For mapreduce, we need to be able to seek to an arbitrary point in the file, then scan
to the next sync point and start reading the file. That's mostly what I mean by random access.{quote}

Ok, I misinterpreted.  I'll call that "seek and scan" for the rest of this comment, as opposed
to random access which I interpret as "go to tuple # 655321" or "read the first tuple following
location X".  It also is related to the limitation that all schemas in the file must be representable
in one big union schema.  If the requirement to read a tuple is only that the reader knows
the schema in the prior metadata block, then what can be stored in one file is less restrictive.

It should also be possible to layer indexes on top of this, to support random access by key.
Indexes might be stored as side files, or perhaps in the file's metadata. To support these,
it should be possible to ask, while writing, the position of the current block start, so that
one may store that in an index and subsequently seek to it, then scan the block for the desired
I agree.  It is useful to leave open the option for index type metadata in the metadata block.
 I'll add that the metadata block might also contain an index into that block to avoid scanning
it (for large blocks).  Unfortunately, to do this with streaming writes, the metadata block
with the index must be _after_ the block.  So, perhaps the metadata block needs two types
of metadata, that which describes a previous block(s) and that which describes the next one?

This is where I start to wonder if serving too many needs in one file type is the right choice.

Let me elaborate on my last proposal. 

I like it, but if we ever want true optimized random access (perhaps not) it would have to
change or we would need side files.

I think it still may make sense to flush metadata at the end of the file. It may no longer
contain the schema, but it can contain things like counts and indexes. Streaming applications
would not be able to use this, but other applications might find it very useful. Side files
in HDFS are expensive.{quote}

It definitely makes sense to flush some metadata at the end, but much of that might be optional.

One useful thing would be the following. 
This allows MapReduce to not have to "seek and scan" but instead find the start of the metadata
block nearest the HDFS block boundary. If counts are stored, it also allows basic random access
by tuple number.

When a file is closed, the last metadata block can contain the offset of each known metadata
block.  Perhaps this is optional, but if it exists then the input splitter can split on those
boundaries and avoid seeking.  When the file is appended, it can either copy-forward this
crude index or keep a reference to the prior "finish" metadata block.

Maybe, a straightforward thing to do is consider that each block in this file has a header,
a data block, and a footer.  The header has the schema of the tuples in the block and any
other information required to read the block, like the compression codec, etc.  The footer
contains the tuple count and other optional info (like an index) and the length of the block.
 The sync marker is in every footer, and in the first block's header.

Ok, I think I'm done with my speculation for now :)

> file format should be friendly to streaming
> -------------------------------------------
>                 Key: AVRO-160
>                 URL: https://issues.apache.org/jira/browse/AVRO-160
>             Project: Avro
>          Issue Type: Improvement
>          Components: spec
>            Reporter: Doug Cutting
> It should be possible to stream through an Avro data file without seeking to the end.
> Currently the interpretation is that schemas written to the file apply to all entries
before them.  If this were changed so that they instead apply to all entries that follow,
and the initial schema is written at the start of the file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to, if it is
a union, to add new branches at the end of that union.  If it is not a union, no changes may
be made.  So it is still the case that the final schema in a file can read every entry in
the file and thus may be used to randomly access the file.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message