hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Meng Mao" <meng...@gmail.com>
Subject correct pattern for using setOutputValueGroupingComparator?
Date Mon, 05 Jan 2009 22:07:45 GMT
I'm trying to use use map reduce to merge two classes of files, each class
using the same keys for grouping. An example:
class 1 input file:
id_1 A metadatum
id_2 A metadatum
id_1 A metadatum

class 2 input file:
id_1 B some numbers
id_1 B some numbers
id_2 B some numbers

I map using the first token, an id string, as the key. Ideally, the
intermediate input to the reducer class would be this (for the key id_1):
id_1 A metadatum
id_1 A metadatum
id_1 B some numbers
id_1 B some numbers

But because there's no guarantee on sorting for the values, we can see:
id_1 B some numbers
id_1 A metadatum
id_1 B some numbers
id_1 A metadatum


I was wondering if I could use setOutputValueGroupingComparator to force
records of the first class to sort to the top. I'm having a hard time
interpreting the documentation though:
If equivalence rules for grouping the intermediate keys are required to be
different from those for grouping keys before reduction, then one may
specify a Comparator via
JobConf.setOutputValueGroupingComparator(Class)<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputValueGroupingComparator%28java.lang.Class%29>.
Since JobConf.setOutputKeyComparatorClass(Class)<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass%28java.lang.Class%29>can
be used to control how intermediate keys are grouped, these can be
used
in conjunction to simulate *secondary sort on values*.

My interpretation is as follows:
----------
class 1 input file:
id_1 A metadatum
id_1 A metadatum

class 2 input file:
id_1 B some numbers
id_2 B some numbers

Map with key = first column + delimiter + second column. Supply
setOutputKeyComparatorClass such that it only compares based on the first
half of the key. Supply setOutputValueGroupingComparator such that it only
compares based on the second half of the key. Thus, all keys like id_1* go
to the same group, and then it is sorted within that group with As first,
and then Bs (or reverse if needed).
----------

Am I vastly overthinking how setOutputValueGroupingComparator works? I can't
tell from the docs if it is possible to peek at the values associated with
the pair of keys in each comparison. If it is, I probably wouldn't have to
use a different key as done in my interpretation.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message