avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <scottca...@apache.org>
Subject Re: Avro MR job problem with empty strings
Date Sat, 03 Sep 2011 00:38:07 GMT
Some ideas:

A String is encoded as a Long length, followed by that number of bytes in
Utf8.
An empty string is therefore encoded as the number 0L -- which is one
byte, 0x00.
It appears that it is trying to skip a string or Long, but it is the end
of the byte[].

So either it is expecting a Long or String to skip, and there is nothing
there.  Perhaps the empty String was not encoded as an empty string, but
skipped.  Perhaps a Long count or other number (What is the Schema being
compared?)  

WordCount is often key = word, val = count, and so it would need to read
the string word, and skip the long count.  If either of these is left out
and not written, I would expect the sort of error below.

I hope that helps,

-Scott

On 9/1/11 5:42 AM, "Friso van Vollenhoven" <fvanvollenhoven@xebia.com>
wrote:

>Hi All,
>
>I am working on a modified version of the Avro MapReduce support to make
>it play nice with the new Hadoop API (0.20.2). Most of the code if
>borrowed from the Avro mapred package, but I decided not to fully
>abstract away the Mapper and Reducer classes (like Avro does now using
>HadoopMapper and HadoopReducer classes). All else is much the same as the
>mapred implementation.
>
>When testing, I ran into a issues when emitting empty strings (empty
>Utf8) from the mapper as key. I get the following:
>org.apache.avro.AvroRuntimeException: java.io.EOFException
>	at org.apache.avro.io.BinaryData.compare(BinaryData.java:74)
>	at org.apache.avro.io.BinaryData.compare(BinaryData.java:60)
>	at 
>org.apache.avro.mapreduce.AvroKeyComparator.compare(AvroKeyComparator.java
>:45)        <== this is my own code
>	at 
>org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:
>120)
>	at 
>org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
>	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
>	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
>	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
>	at 
>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:256)
>Caused by: java.io.EOFException
>	at org.apache.avro.io.BinaryDecoder.readLong(BinaryDecoder.java:182)
>	at 
>org.apache.avro.generic.GenericDatumReader.skip(GenericDatumReader.java:38
>9)
>	at org.apache.avro.io.BinaryData.compare(BinaryData.java:86)
>	at org.apache.avro.io.BinaryData.compare(BinaryData.java:72)
>	... 8 more
>
>
>The root cause stack trace is as follows (taken from debugger, breakpoint
>on the throw new EOFException(); line):
>Thread [Thread-11] (Suspended (breakpoint at line 182 in BinaryDecoder))	
>	BinaryDecoder.readLong() line: 182	
>	GenericDatumReader<D>.skip(Schema, Decoder) line: 389	
>	BinaryData.compare(BinaryData$Decoders, Schema) line: 86	
>	BinaryData.compare(byte[], int, int, byte[], int, int, Schema) line: 72	
>	BinaryData.compare(byte[], int, byte[], int, Schema) line: 60	
>	AvroKeyComparator<T>.compare(byte[], int, int, byte[], int, int) line:
>45	
>	Reducer$Context(ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>).nextKeyValu
>e() line: 120	
>	Reducer$Context(ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>).nextKey()
>line: 92	
>	AvroMapReduceTest$WordCountingAvroReducer(Reducer<KEYIN,VALUEIN,KEYOUT,VA
>LUEOUT>).run(Reducer<KEYIN,VALUEIN,KEYOUT,Contex>) line: 175	
>	ReduceTask.runNewReducer(JobConf, TaskUmbilicalProtocol, TaskReporter,
>RawKeyValueIterator, RawComparator<INKEY>, Class<INKEY>, Class<INVALUE>)
>line: 572	
>	ReduceTask.run(JobConf, TaskUmbilicalProtocol) line: 414	
>	LocalJobRunner$Job.run() line: 256	
>
>I went through the decoding code to see where this comes from, but I
>can't immediately spot where it goes wrong. I am guessing the actual
>problem is earlier during execution where it possibly increases pos too
>often.
>
>Has anyone experienced this? I can live without emitting empty keys from
>MR jobs, but I ran into this implementing a word count job on a text file
>with empty lines (counting those could be a valid use case). I am using
>Avro 1.5.2.
>
>Thanks for any clues.
>
>
>Cheers,
>Friso
>



Mime
View raw message