i had this problem once too. Did you properly overwrite the reduce method with the @override annotation?
Does your reduce method use OutputCollector or Context for gathering outputs? If you are using current version, it has to be Context.

The thing is: if you do NOT override the standart reduce function (identity) is used and this results ofc in the same number of tuples as you read as input.

Good luck!

Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann <sigurd.spieckermann@gmail.com>:

I think I have tracked down the problem to the point that each split only contains one big key-value pair and a combiner is connected to a map task. Please correct me if I'm wrong, but I assume each map task takes one split and the combiner operates only on the key-value pairs within one split. That's why the combiner has no effect in my case. Is there a way to combine the mapper outputs of multiple splits before they are sent off to the reducer?

2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com>
Maybe one more note: the combiner and the reducer class are the same and in the reduce-phase the values get aggregated correctly. Why is this not happening in the combiner-phase?

2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com>
Hi guys,

I'm experiencing a strange behavior when I use the Hadoop join-package. After running a job the result statistics show that my combiner has an input of 100 records and an output of 100 records. From the task I'm running and the way it's implemented, I know that each key appears multiple times and the values should be combinable before getting passed to the reducer. I'm running my tests in pseudo-distributed mode with one or two map tasks. From using the debugger, I noticed that each key-value pair is processed by a combiner individually so there's actually no list passed into the combiner that it could aggregate. Can anyone think of a reason that causes this undesired behavior?