avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Douglas Creager (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation.
Date Tue, 27 Dec 2011 15:06:30 GMT

     [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Douglas Creager updated AVRO-986:

    Attachment: 0001-avromod-utility.patch

Here's a patch that adds a new "avromod" command-line utility.  It can be used as a fixup
script to remove the avro.sync field from the header (once I commit Michael's patch).  It's
also useful in its own right since you can create copies of Avro files with different compression
codecs and block sizes.  Eventually, we can also add options for changing the schema of the
data in the file.
> Avro files generated from avro-c dont work with the Java mapred implementation.
> -------------------------------------------------------------------------------
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch, 0001-avromod-utility.patch,
AVRO-986-java.patch, quickstop.db
> When a file generated from the Avro-C implementation is fed into Hadoop, it will fail
with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where
it should read from, but this class is not particularly smart, it just divides the file up
into equal size chunks, the first being with position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync
> It encounters one at position 32, the one in the header metadata map, "avro.sync"
> No other implementations add the sync marker in the metadata map, and none read it from
there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that knows where
the blocks are, and provides the correct locations in the first place.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message