avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-160) file format should be friendly to streaming
Date Sat, 02 Jan 2010 16:58:54 GMT

    [ https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795870#action_12795870
] 

Doug Cutting commented on AVRO-160:
-----------------------------------

> you have removed the count of the number of objects, but you have kept the count of the
number of bytes, correct? That's what the new spec says

No, the spec currently says each block is prefixed by "a long indicating the count of objects
in this block".  This is as it was before, without a byte count.  Byte counts are left to
codec implementations on an as-needed basis.

> Also, you use the term "split" in the Java code but do not use it in the spec.

I use that term in the unit test.  The term is borrowed from Hadoop MapReduce, where it refers
to dividing a file at arbitrary points among tasks.  This is an important use case for the
Java data file implementation.  It requires nothing in the spec more than periodic sync markers.
 Probably only the Java implementation needs to implement methods like DataFileReader#sync()
and DataFileReader#pastSync(), since Hadoop MapReduce is in Java.


> file format should be friendly to streaming
> -------------------------------------------
>
>                 Key: AVRO-160
>                 URL: https://issues.apache.org/jira/browse/AVRO-160
>             Project: Avro
>          Issue Type: Improvement
>          Components: spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-160-python.patch, AVRO-160.patch, AVRO-160.patch, AVRO-160.patch,
AVRO-160.patch
>
>
> It should be possible to stream through an Avro data file without seeking to the end.
> Currently the interpretation is that schemas written to the file apply to all entries
before them.  If this were changed so that they instead apply to all entries that follow,
and the initial schema is written at the start of the file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to, if it is
a union, to add new branches at the end of that union.  If it is not a union, no changes may
be made.  So it is still the case that the final schema in a file can read every entry in
the file and thus may be used to randomly access the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message