hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-64) Map-side sort is hampered by io.sort.record.percent
Date Tue, 20 Oct 2009 05:01:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767693#action_12767693

Todd Lipcon commented on MAPREDUCE-64:

Thanks for those great diagrams - they really helped me understand things much better! A picture
is worth 1000 lines of code or something :)

I applied your patch just now and ran it through clover for coverage analysis. Here are a
couple things I think we should cover before committing:

- We don't current run any tests with job.getCompressMapOutput returning true. This caused
an issue or two with the shuffle in the past, so we should get at least one test that uses
a codec.
- Since we're using the Local Runner for these tests, it's all a single partition. This is
probably OK, since I imagine other tests throughout Hadoop exercise those paths (I'm only
looking at coverage from TestMapCollection here)
- Line 1097 ("if (bufindex + headbytelen < avail) {" in void reset()) is always true in
our tests. We should get a test case to exercise the other half of this branch.
- Line 1365 (kvstart >= kvend ternary in sortAndSpill) is always true. Should exercise
the other half of that.

On a code level, one more thing I noticed - can you put in a small comment describing the
synchronization policy for the various offsets? Those used to be volatile and now they're
under a lock, so it should be good to note that in the code.

I'll try to get a chance to run some basic benchmarks later this week.

> Map-side sort is hampered by io.sort.record.percent
> ---------------------------------------------------
>                 Key: MAPREDUCE-64
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-64
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Arun C Murthy
>            Assignee: Chris Douglas
>         Attachments: M64-0.patch, M64-0i.png, M64-1.patch, M64-1i.png, M64-2.patch, M64-2i.png,
> Currently io.sort.record.percent is a fairly obscure, per-job configurable, expert-level
parameter which controls how much accounting space is available for records in the map-side
sort buffer (io.sort.mb). Typically values for io.sort.mb (100) and io.sort.record.percent
(0.05) imply that we can store ~350,000 records in the buffer before necessitating a sort/combine/spill.
> However for many applications which deal with small records e.g. the world-famous wordcount
and it's family this implies we can only use 5-10% of io.sort.mb i.e. (5-10M) before we spill
inspite of having _much_ more memory available in the sort-buffer. The word-count for e.g.
results in ~12 spills (given hdfs block size of 64M). The presence of a combiner exacerbates
the problem by piling serialization/deserialization of records too...
> Sure, jobs can configure io.sort.record.percent, but it's tedious and obscure; we really
can do better by getting the framework to automagically pick it by using all available memory
(upto io.sort.mb) for either the data or accounting.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message