avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: Invalid sync error when reading Avro file (Amazon EMR Hadoop)
Date Thu, 26 May 2011 16:14:43 GMT
Hi Matt,

You may know this already, but for general reference:

There is an Avro API for concatenating Avro files (and potentially compressing them) in Java.
 Additionally, there is a command line tool for concatenating in avro-tools.jar.

-Scott

On 5/26/11 7:46 AM, "Matt Pouttu-Clarke" <Matt.Pouttu-Clarke@icrossing.com<mailto:Matt.Pouttu-Clarke@icrossing.com>>
wrote:

Hi Scott,

Thanks for the response.  It turns out that in part of our code we were concatenating smaller
files into larger files using i/o streams.  This used to work fine when the files were JSON
text files.  However, we learned that hard way that with Avro you cannot concatenate files
in the traditional sense.  Unless you parse the inputs and merge them using the Avro APIs
you get the ‘Invalid sync’ error when attempting to read the cat file.  Obviously in retrospect
this has to do with the JSON schema at the beginning of each file not being valid in the middle
of the concatenated file.

The reason why we didn’t see this on our Apache Hadoop dev cluster was the data size was
smaller and the concatenation was 1-to-1.

Maybe a better error message would have led us to this conclusion sooner?  Other than that
it’s not really Avro’s problem.

-Matt

On 5/25/11 4:40 PM, "Scott Carey" <scott@richrelevance.com> wrote:

The svn change you note is from AVRO-160.  Avro's file format changed between Avro 1.2 and
1.3.
Recent versions (Avro 1.5.x and perhaps 1.4.1) have a file reader class for Avro 1.2 that
is separate in case old format files need to be read.

We weren't aware of anyone using the 1.2 format at the time we changed (see AVRO-160).

I'm not sure your error below is due to that change however.  Does the error below occur before
any records are retrieved? or part-way through after some have been accessed?



On 5/25/11 2:34 PM, "Matt Pouttu-Clarke" <Matt.Pouttu-Clarke@icrossing.com> wrote:

Getting this error when reading an Avro file on Amazon EMR Hadoop.  Does not occur on any
recent Apache Hadoop build.

Exception org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
    at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:176)
    at Abc.readAvroFile(Abc.java:28)
    at Abc.main(Abc.java:65)
Caused by: java.io.IOException: Invalid sync!
    at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:258)
    at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:164)
    ... 2 more

Source code that throws the Invalid sync! exception indicates low level IO problem:
{code}
244      DataBlock nextRawBlock(DataBlock reuse) throws IOException {
245        if (!hasNextBlock()) {
246          throw new NoSuchElementException();
247        }
248        if (reuse == null || reuse.data.length < (int) blockSize) {
249          reuse = new DataBlock(blockRemaining, (int) blockSize);
250        } else {
251          reuse.numEntries = blockRemaining;
252          reuse.blockSize = (int)blockSize;
253        }
254        // throws if it can't read the size requested
255        vin.readFixed(reuse.data, 0, reuse.blockSize);
256        vin.readFixed(syncBuffer);
257        if (!Arrays.equals(syncBuffer, sync))
258          throw new IOException("Invalid sync!");
259        availableBlock = false;
260        return reuse;
261      }
{code}

Looks like this commit from Doug Cutting removed those error messages:
http://www.mail-archive.com/avro-commits@hadoop.apache.org/msg00218.html

Anyone have any clue as to what could cause these errors?

Thanks,
Matt


iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may contain confidential
and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by reply email
and destroy all copies of the original message.


iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may contain confidential
and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by reply email
and destroy all copies of the original message.

Mime
View raw message