hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukavsky, Jan" <Jan.Lukav...@firma.seznam.cz>
Subject Re: Partitioner vs GroupComparator
Date Fri, 23 Aug 2013 19:13:13 GMT
Hi Shahab,

thanks, I just missed the fact that the key gets updated while iterating the values. Although
working with Hadoop for three years there is always something that can surprise you. :-)

Cheers,
 Jan



-------- Original message --------
Subject: Re: Partitioner vs GroupComparator
From: Shahab Yunus <shahab.yunus@gmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
CC:


Jan

" is that you need to put the data you want to secondary sort into your key class. "
Yes but then you can also don't put the secondary sort column/data piece in the value part
and this way there will be no duplication.

" But, what I just realized is that the original key probably IS accessible, because of the
Writable semantics. As you iterate through the Iterable passed to the reduce call the Key
changes its contents. Am I right? "

Yes.

"all howtos on doing secondary sort look. All I have seen duplicate the secondary part of
the key in value."

Check this link below where 'null' value is being passed because that has already been captured
as part of the key due to secondary sort requirements.
http://www.javacodegeeks.com/2013/01/mapreduce-algorithms-secondary-sorting.html

Regards,
Shahab




On Fri, Aug 23, 2013 at 1:34 PM, Lukavsky, Jan <Jan.Lukavsky@firma.seznam.cz<mailto:Jan.Lukavsky@firma.seznam.cz>>
wrote:
Hi Shahab, I'm not sure if I understand right, but the problem is that you need to put the
data you want to secondary sort into your key class. But, what I just realized is that the
original key probably IS accessible, because of the Writable semantics. As you iterate through
the Iterable passed to the reduce call the Key changes its contents. Am I right? This seems
a bit weird but probably is how it works. I just overlooked this, because of the way the API
looks and how all howtos on doing secondary sort look. All I have seen duplicate the secondary
part of the key in value.

Jan



-------- Original message --------
Subject: Re: Partitioner vs GroupComparator
From: Shahab Yunus <shahab.yunus@gmail.com<mailto:shahab.yunus@gmail.com>>
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
CC:


@Jan, why not, not send the 'hidden' part of the key as a value? Why not then pass value as
null or with some other value part. So in the reducer side there is no duplication and you
can extract the 'hidden' part of the key yourself (which should be possible as you will be
encapsulating it in a some class/object model...?

Regards,
Shahab




On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavsk√Ĺ <jan.lukavsky@firma.seznam.cz<mailto:jan.lukavsky@firma.seznam.cz>>
wrote:
Hi all,

when speaking about this, has anyone ever measured how much more data needs to be transferred
over the network when using GroupingComparator the way Harsh suggests? What do I mean, when
you use the GroupingComparator, it hides you the real key that you have emitted from Mapper.
You just see the first key in the reduce group and any data that was carried in the key needs
to be duplicated in the value in order to be accessible on the reduce end.

Let's say you have key consisting of two parts (base, extension), you partition by the 'base'
part and use GroupingComparator to group keys with the same base part. Than you have no other
chance than to emit from Mapper something like this - (key: (base, extension), value: extension),
which means the 'extension' part is duplicated in the data, that has to be transferred over
the network. This overhead can be diminished by using compression between map and reduce side,
but I believe that in some cases this can be significant.

It would be nice if the API allowed to access the 'real' key for each value, not only the
first key of the reduce group. The only

Mime
View raw message