hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongwon Kim" <eastcirc...@postech.ac.kr>
Subject About MapTask.java
Date Thu, 24 Feb 2011 12:56:08 GMT


I want to know how "MapTask.java" is implemented, especially
"MapOutputBuffer" class defined in "MapTask.java".

I've been trying to read "MapTask.java" after reading some references such
as "Hadoop definitive guide" and
"http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html", but
it's quite tough to directly read the code without detailed comments.


As I know, when each intermediate (key, value) pair is generated by the
user-defined map function, the pair is written by "MapOutputBuffer" class
defined in "MapTask.java" with MapOutputBuffer.collect() invoked.

However, I can't understand what each variable defined in "MapOutputBuffer"

What I've understood is as follows (* please correct any misunderstanding): 

- The byte buffer "kvbuffer" is where each actual (partition, key, value)
triple is written.

- An integer array "kvindices" is called "accounting buffer", every three
elements of which save indices to the corresponding triple in "kvbuffer".

- Another integer array "kvoffsets" contains indices of triples in

- "kvstart", "kvend", "kvindex" are used to point "kvindex"

- "bufstart", "bufend", "bufvoid", "bufindex", "bufmark" are used to point


What I can't understand is the comments beside variable definitions.

===================== definitions of some variables

    private volatile int kvstart = 0;  // marks beginning of *spill*

    private volatile int kvend = 0;    // marks beginning of *collectable*

    private int kvindex = 0;           // marks end of *collected*

    private final int[] kvoffsets;     // indices into kvindices

    private final int[] kvindices;     // partition, k/v offsets into

    private volatile int bufstart = 0; // marks beginning of *spill*

    private volatile int bufend = 0;   // marks beginning of *collectable*

    private volatile int bufvoid = 0;  // marks the point where we should

                                       // reading at the end of the buffer

    private int bufindex = 0;          // marks end of *collected*

    private int bufmark = 0;           // marks end of *record*

    private byte[] kvbuffer;           // main output buffer




What do the terms "spill", "collectable", and "collected" mean?

I guess, because map outputs continue to be written to the buffer while the
spill takes place, there must be at least two pointers: from where to write
map outputs and to where to spill data; but I don't know what those "spill"
"collectable", and "collected" mean exactly.



Is it efficient to partition data first and then sort records inside each

Does it happen to avoid comparing expensive pair-wise key comparisons?



Are there any documents containing explanations about how such internal
classes are implemented? 







  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message