hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>
Subject Re: Join-package combiner number of input and output records the same
Date Tue, 25 Sep 2012 16:34:14 GMT
I'm not doing a conventional join, but in my case one split/file 
consists of only one key-value pair. I'm not using default 
mapper/reducer implementations. I'm guessing the problem is that a 
combiner is only applied to the output of a map task which is an 
instance of the mapper class, but one map task processes one split and 
since I only have one key-value pair per split, there is nothing to 
combine. What I would need is a combiner across multiple map tasks or a 
way to treat all splits of a datanode as one, hence there would only be 
one map task. Is there a way to do something like that? Reusing the JVM 
hasn't worked in my tests.

Am 25.09.2012 15:40, schrieb Björn-Elmar Macek:
> Ups, sorry. You are using standart implementations? I dont know whats
> happening then. Sorry. But the fact, that your inputsize equals your
> outputsize in a "join" process reminded me too much of my own problems.
> Sorry for confusion, i may have caused.
>
> Best,
> Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <macek@cs.uni-kassel.de
> <mailto:macek@cs.uni-kassel.de>>:
>
>> Hi,
>>
>> i had this problem once too. Did you properly overwrite the reduce
>> method with the @override annotation?
>> Does your reduce method use OutputCollector or Context for gathering
>> outputs? If you are using current version, it has to be Context.
>>
>> The thing is: if you do NOT override the standart reduce function
>> (identity) is used and this results ofc in the same number of tuples
>> as you read as input.
>>
>> Good luck!
>> Elmar
>>
>> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann
>> <sigurd.spieckermann@gmail.com <mailto:sigurd.spieckermann@gmail.com>>:
>>
>>> I think I have tracked down the problem to the point that each split
>>> only contains one big key-value pair and a combiner is connected to a
>>> map task. Please correct me if I'm wrong, but I assume each map task
>>> takes one split and the combiner operates only on the key-value pairs
>>> within one split. That's why the combiner has no effect in my case.
>>> Is there a way to combine the mapper outputs of multiple splits
>>> before they are sent off to the reducer?
>>>
>>> 2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>> <mailto:sigurd.spieckermann@gmail.com>>
>>>
>>>     Maybe one more note: the combiner and the reducer class are the
>>>     same and in the reduce-phase the values get aggregated correctly.
>>>     Why is this not happening in the combiner-phase?
>>>
>>>
>>>     2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>     <mailto:sigurd.spieckermann@gmail.com>>
>>>
>>>         Hi guys,
>>>
>>>         I'm experiencing a strange behavior when I use the Hadoop
>>>         join-package. After running a job the result statistics show
>>>         that my combiner has an input of 100 records and an output of
>>>         100 records. From the task I'm running and the way it's
>>>         implemented, I know that each key appears multiple times and
>>>         the values should be combinable before getting passed to the
>>>         reducer. I'm running my tests in pseudo-distributed mode with
>>>         one or two map tasks. From using the debugger, I noticed that
>>>         each key-value pair is processed by a combiner individually
>>>         so there's actually no list passed into the combiner that it
>>>         could aggregate. Can anyone think of a reason that causes
>>>         this undesired behavior?
>>>
>>>         Thanks
>>>         Sigurd
>>>
>>>
>>>
>>
>

Mime
View raw message