hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: Join-package combiner number of input and output records the same
Date Tue, 25 Sep 2012 17:02:30 GMT
Hi,

Could you provide an approximation of the data volumes you are dealing with?
If I understand correctly, your map tasks produce almost nothing (it
depends on the size of your key/value, I guess).
My questions are 1) is the combiner really useful in your context? 2) is
the reducer really useful in your context?

Back to your problem, when a map task is done, the sorted output can
already be send to the reducer.
So waiting for the node do to all its map tasks may not be the best
solution. Furthermore, you have to know how many tasks this node will have,
knowing which one is the last one is not obvious (unless you wait for
absolutely all map tasks to finish...) and if the node is lost, all
computations should be redone again... So doing a bigger combiner might not
be a best solution, at least if the general case is considered.

A solution might be to skip combiner and reducer, put the output of the map
tasks into a datastore and work on that. But it will depend on your
context, of course.

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:34 PM, Sigurd Spieckermann <
sigurd.spieckermann@gmail.com> wrote:

> I'm not doing a conventional join, but in my case one split/file consists
> of only one key-value pair. I'm not using default mapper/reducer
> implementations. I'm guessing the problem is that a combiner is only
> applied to the output of a map task which is an instance of the mapper
> class, but one map task processes one split and since I only have one
> key-value pair per split, there is nothing to combine. What I would need is
> a combiner across multiple map tasks or a way to treat all splits of a
> datanode as one, hence there would only be one map task. Is there a way to
> do something like that? Reusing the JVM hasn't worked in my tests.
>
> Am 25.09.2012 15:40, schrieb Björn-Elmar Macek:
>
>> Ups, sorry. You are using standart implementations? I dont know whats
>> happening then. Sorry. But the fact, that your inputsize equals your
>> outputsize in a "join" process reminded me too much of my own problems.
>> Sorry for confusion, i may have caused.
>>
>> Best,
>> Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <macek@cs.uni-kassel.de
>> <mailto:macek@cs.uni-kassel.de**>>:
>>
>>  Hi,
>>>
>>> i had this problem once too. Did you properly overwrite the reduce
>>> method with the @override annotation?
>>> Does your reduce method use OutputCollector or Context for gathering
>>> outputs? If you are using current version, it has to be Context.
>>>
>>> The thing is: if you do NOT override the standart reduce function
>>> (identity) is used and this results ofc in the same number of tuples
>>> as you read as input.
>>>
>>> Good luck!
>>> Elmar
>>>
>>> Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann
>>> <sigurd.spieckermann@gmail.com <mailto:sigurd.spieckermann@**gmail.com<sigurd.spieckermann@gmail.com>
>>> >>:
>>>
>>>  I think I have tracked down the problem to the point that each split
>>>> only contains one big key-value pair and a combiner is connected to a
>>>> map task. Please correct me if I'm wrong, but I assume each map task
>>>> takes one split and the combiner operates only on the key-value pairs
>>>> within one split. That's why the combiner has no effect in my case.
>>>> Is there a way to combine the mapper outputs of multiple splits
>>>> before they are sent off to the reducer?
>>>>
>>>> 2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>> <mailto:sigurd.spieckermann@**gmail.com <sigurd.spieckermann@gmail.com>
>>>> >>
>>>>
>>>>
>>>>     Maybe one more note: the combiner and the reducer class are the
>>>>     same and in the reduce-phase the values get aggregated correctly.
>>>>     Why is this not happening in the combiner-phase?
>>>>
>>>>
>>>>     2012/9/25 Sigurd Spieckermann <sigurd.spieckermann@gmail.com
>>>>     <mailto:sigurd.spieckermann@**gmail.com<sigurd.spieckermann@gmail.com>
>>>> >>
>>>>
>>>>
>>>>         Hi guys,
>>>>
>>>>         I'm experiencing a strange behavior when I use the Hadoop
>>>>         join-package. After running a job the result statistics show
>>>>         that my combiner has an input of 100 records and an output of
>>>>         100 records. From the task I'm running and the way it's
>>>>         implemented, I know that each key appears multiple times and
>>>>         the values should be combinable before getting passed to the
>>>>         reducer. I'm running my tests in pseudo-distributed mode with
>>>>         one or two map tasks. From using the debugger, I noticed that
>>>>         each key-value pair is processed by a combiner individually
>>>>         so there's actually no list passed into the combiner that it
>>>>         could aggregate. Can anyone think of a reason that causes
>>>>         this undesired behavior?
>>>>
>>>>         Thanks
>>>>         Sigurd
>>>>
>>>>
>>>>
>>>>
>>>
>>


-- 
Bertrand Dechoux

Mime
View raw message