Return-Path: X-Original-To: apmail-avro-dev-archive@www.apache.org Delivered-To: apmail-avro-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D64A7992A for ; Thu, 22 Dec 2011 22:33:52 +0000 (UTC) Received: (qmail 15478 invoked by uid 500); 22 Dec 2011 22:33:52 -0000 Delivered-To: apmail-avro-dev-archive@avro.apache.org Received: (qmail 15435 invoked by uid 500); 22 Dec 2011 22:33:52 -0000 Mailing-List: contact dev-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@avro.apache.org Delivered-To: mailing list dev@avro.apache.org Received: (qmail 15420 invoked by uid 99); 22 Dec 2011 22:33:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Dec 2011 22:33:52 +0000 X-ASF-Spam-Status: No, hits=-2002.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Dec 2011 22:33:51 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id A40D9124F4A for ; Thu, 22 Dec 2011 22:33:30 +0000 (UTC) Date: Thu, 22 Dec 2011 22:33:30 +0000 (UTC) From: "Doug Cutting (Updated) (JIRA)" To: dev@avro.apache.org Message-ID: <1256957569.40769.1324593210673.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <562633454.38021.1324526971642.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (AVRO-986) Avro files generated from avro-c dont work with the Java mapred implementation. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated AVRO-986: ------------------------------ Attachment: AVRO-986-java.patch Perhaps we should fix the Java code too. Here's a patch that should do the trick. To test this we should probably add a file to share/test/data that has "avro.sync" in its metadata and test that reads after a DataFileReader#sync(0) on this work correctly. > Avro files generated from avro-c dont work with the Java mapred implementation. > ------------------------------------------------------------------------------- > > Key: AVRO-986 > URL: https://issues.apache.org/jira/browse/AVRO-986 > Project: Avro > Issue Type: Bug > Components: c, java > Environment: avro-c 1.6.2-SNAPSHOT > avro-java 1.6.2-SNAPSHOT > hadoop 0.20.2 > Reporter: Michael Cooper > Priority: Critical > Labels: c, hadoop, java, mapreduce > Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch, AVRO-986-java.patch > > > When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49". > This is caused by the sync marker, namely the one that Avro-C puts into the header... > The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0. > So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls > {code:title=AvroRecordReader.java}reader.sync(split.getStart()); // sync to start{code} > Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker.... > It encounters one at position 32, the one in the header metadata map, "avro.sync" > No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version. > I suggest we remove this from the header as the simplest solution. > Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira