hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Tang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-64) Map-side sort is hampered by io.sort.record.percent
Date Tue, 15 Dec 2009 05:35:18 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790569#action_12790569

Hong Tang commented on MAPREDUCE-64:

The design is quite clever and elegant. I like it. The code is a clean, but a bit tricky to
understand (more on this later with some of my suggestions on refactory).

- MapOutputBuffer.collect: The logic of calculating the equator seems to be missing a multipication
of METASIZE. Should be:
               final int newPos = (bufindex +
                 Math.max(2 * METASIZE - 1,
                         Math.min(distkvi / 2, distkvi / (METASIZE + avgRec) * METASIZE)))
- Buffer.write(byte[], int, int): "blockwrite = distkvi < distkve" should be "blockwrite
= distkvi <= distkve"
- A potential inefficiency if we encounter a large record when there are few (but not zero)
records in the buffer - this would lead to these few records written out as a single spill.
A better way is to spill out the single large record, and continue accumulating records after
that. This should be a very rare corner case so may not need to be addressed in this jira.
Would be nice to mark it with TODO in the comments.
- Any particular reason to shut down the thread in Buffer.flush() rather than Buffer.close()?
- In SpillThread: " if (bufend < bufindex && bufindex < bufstart)" should probably
be " if (bufend < bufstart) {"
- In TestMapCollection: uniform random is used to determine how many bytes to write in serialization,
and to determine key/value size for RandomFactory. This is less desirable in the sense that
very small values are not sufficiently tested. Suggest to change to a distribution that gives
more weight to small values e.g. (min + Math.exp(random.nextDouble()*Math.log(max-min))).

I also have a couple of suggestions on refactoring the code to make it more readable:
- Separate the sets of variables used by main thread for writing from the set of variables
for the spill threads for spilling. (Currently kvend and bufend are used in two different
context: when there is a spill active or when there is not).
- Related to the above, adding a variable called spillExists to describe the state when there
is a spill buffer. The life time of spillExists==TRUE covers that of spillInProgress==TRUE.
- suggest to change the direct (idx+offset) based access to kvmeta to method calls.
- Suggest to refactor the logic on marking a spill region.

Other very minor nits:
- MapOutputBuffer.collect: it would be nice to spell out the invariance that there are always
METASIZE bytes available beyond kvindex.
- MapOutputBuffer: document the use of "bufferRemaining" as a hint whether we *may* need to
block and spill. If bufferRemaining>=0, there is guaranteed space for us to continue write.
- BlockBuffer is only usable inside MapOutputBuffer, suggest remove the constructor BlockBuffer(OutputStream).
- Suggest rename BlockBuffer.reset() to BlockBuffer.shiftKeyBuffer().
- Suggest to add a note to Buffer.write(byte[], int, int) that the checking of bufferRemaining
should not be bypassed even if len==0.

> Map-side sort is hampered by io.sort.record.percent
> ---------------------------------------------------
>                 Key: MAPREDUCE-64
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-64
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Arun C Murthy
>            Assignee: Chris Douglas
>         Attachments: M64-0.patch, M64-0i.png, M64-1.patch, M64-1i.png, M64-2.patch, M64-2i.png,
M64-3.patch, M64-4.patch
> Currently io.sort.record.percent is a fairly obscure, per-job configurable, expert-level
parameter which controls how much accounting space is available for records in the map-side
sort buffer (io.sort.mb). Typically values for io.sort.mb (100) and io.sort.record.percent
(0.05) imply that we can store ~350,000 records in the buffer before necessitating a sort/combine/spill.
> However for many applications which deal with small records e.g. the world-famous wordcount
and it's family this implies we can only use 5-10% of io.sort.mb i.e. (5-10M) before we spill
inspite of having _much_ more memory available in the sort-buffer. The word-count for e.g.
results in ~12 spills (given hdfs block size of 64M). The presence of a combiner exacerbates
the problem by piling serialization/deserialization of records too...
> Sure, jobs can configure io.sort.record.percent, but it's tedious and obscure; we really
can do better by getting the framework to automagically pick it by using all available memory
(upto io.sort.mb) for either the data or accounting.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message