avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Reduce Side Combiner
Date Tue, 25 Oct 2011 16:44:24 GMT
I've filed a Jira and posted a patch:

https://issues.apache.org/jira/browse/AVRO-944

Can you please tell me whether this patch fixes things for you?

Thanks,

Doug

On 10/19/2011 06:20 PM, Elliott Clark wrote:
> When running a map reduce job using avro mapred we're having some issues
> with combiners.
> 
> When running over a small data set map side combiners run and report
> that they combined records.
> When running over a larger data set combiners run and report that they
> combined 1.4 Billion records into 6 million.  However the reduce phase
> fails with:
> 
> 2011-10-19 21:37:34,777 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201109220009_0156_r_000000_0
Merge of the inmemory files threw an exception: java.io.IOException: Intermediate merge failed
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2714)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2639)
> Caused by: org.apache.avro.AvroRuntimeException: No field named rowKey in: null
> 	at org.apache.avro.reflect.ReflectData.findField(ReflectData.java:194)
> 	at org.apache.avro.reflect.ReflectData.getField(ReflectData.java:179)
> 	at org.apache.avro.reflect.ReflectData.getField(ReflectData.java:96)
> 	at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:102)
> 	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:65)
> 	at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:102)
> 	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:57)
> 	at org.apache.avro.mapred.AvroSerialization$AvroWrapperSerializer.serialize(AvroSerialization.java:131)
> 	at org.apache.avro.mapred.AvroSerialization$AvroWrapperSerializer.serialize(AvroSerialization.java:114)
> 	at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179)
> 	at org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:1025)
> 	at org.apache.avro.mapred.HadoopCombiner$PairCollector.collect(HadoopCombiner.java:52)
> 	at org.apache.avro.mapred.HadoopCombiner$PairCollector.collect(HadoopCombiner.java:40)
> 	at com.ngmoco.ngpipes.sourcing.NgBucketingEventCountingCombiner.reduce(NgBucketingEventCountingCombiner.java:63)
> 	at com.ngmoco.ngpipes.sourcing.NgBucketingEventCountingCombiner.reduce(NgBucketingEventCountingCombiner.java:17)
> 	at org.apache.avro.mapred.HadoopReducerBase.reduce(HadoopReducerBase.java:61)
> 	at org.apache.avro.mapred.HadoopReducerBase.reduce(HadoopReducerBase.java:30)
> 	at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1296)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2701)
> 	... 1 more
> 
> 
> 
> rowKey is only present in our output schema.  In looking at the code it
> looks like the combiner is using the wrong collector.
> 
> Commenting out the Combiner means that everything works well.  Running
> over a smaller dataset results in the job running well.  Basically
> anything that means that
> https://issues.apache.org/jira/browse/HADOOP-3226 doesn't run means that
> the job works.
> 
> Any ideas on how to either fix this?  The above patch to hadoop was
> committed to trunk without any additional tests so I'm not really sure
> how to get this to repro on a small non-distributed scale for a unit test.

Mime
View raw message