hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saptarshi Guha <saptarshi.g...@gmail.com>
Subject MapFileoutput Format: keys out of order when emitting in reduce (Hadoop 0.20)
Date Wed, 23 Dec 2009 20:46:05 GMT

I re-wrote MapFileOutputFormat for use with Hadoop 0.20.1 and have a question.
Suppose my Map sends key-value pairs to the reducers.
In my reducer, for a given key value, i emit key1,value1, key2,value2, ... ,

e.g the key (sent to reduce) is e780f987932c84d41e4f14d7607fcb69c6889
(stored as bytes writable
variation) and value is several lines

In the reduce, i emit

key=(e780f987932c84d41e4f14d7607fcb69c6889, 1), value= subset of values
key=(e780f987932c84d41e4f14d7607fcb69c6889, 2), value= subset of values and so

(The key, values stored in a binary form, the comparator is a binary

So the reduce will be emitting keys in a not necessarily sorted order and
MapOutputFormat throws the following exception:

Reduce:java.io.IOException: key out of order:
 "e780f987932c84d41e4f14d7607fcb69c6889" "1"
 after "e72e96c506c4e5cefbc2889e124228f67d121" "10"
at org.apache.hadoop.io.MapFile$Writer.checkKey(MapFile.java:206)
        at org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:192)
        at org.godhuli.rhipe.RHMapFileOutputFormat$1.write(RHMapFileOutputFormat.java:79)
(out of order using binary comparator)

I know the reduce receives keys in sorted order, but the keys it emits may not
be, so I'm not totally surprised.

Q1: Is this expected with MapFileOutputFormat?
Q2: Is the work around to emit as SequenceFileOutputFormat, then run an identity
map (with a reduce) and output as MapFileOutputFormat? If so, doesn't this force
the user to use double the space(at least)?

Much thanks

View raw message