hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (HADOOP-2054) Improve memory model for map-side sorts
Date Mon, 31 Mar 2008 22:54:24 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Owen O'Malley resolved HADOOP-2054.

       Resolution: Duplicate
    Fix Version/s: 0.17.0
         Assignee: Chris Douglas  (was: Arun C Murthy)

This was fixed by HADOOP-2919.

> Improve memory model for map-side sorts
> ---------------------------------------
>                 Key: HADOOP-2054
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2054
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Arun C Murthy
>            Assignee: Chris Douglas
>             Fix For: 0.17.0
> {{MapTask#MapOutputBuffer}} uses a plain-jane {{DataOutputBuffer}} which defaults to
a buffer of size 32-bytes, and the {{DataOutputBuffer#write}} call doubles the underlying
byte-array when it needs more space.
> However for maps which output any decent amount of data (e.g. 128MB in examples/Sort.java)
this means the buffer grows painfully slowly from 2^6 to 2^28, and each time this results
in a new array being created, followed by an array-copy:
> {noformat}
>     public void write(DataInput in, int len) throws IOException {
>       int newcount = count + len;
>       if (newcount > buf.length) {
>         byte newbuf[] = new byte[Math.max(buf.length << 1, newcount)];
>         System.arraycopy(buf, 0, newbuf, 0, count);
>         buf = newbuf;
>       }
>       in.readFully(buf, count, len);
>       count = newcount;
>     }
> {noformat}
> I reckon we could do much better in the {{MapTask}}, specifically... 
> For e.g. we start with a buffer of size 1/4KB and quadruple, rather than double, upto,
say 4/8/16MB. Then we resume doubling (or less).
> This means that it quickly ramps up to minimize no. of {{System.arrayCopy}} calls and
small-sized buffers to GC; and later start doubling to ensure we don't ramp-up too quickly
to minimize memory wastage due to fragmentation.
> Of course, this issue is about benchmarking and figuring if all this is worth it, and,
if so, what are the right set of trade-offs to make.
> Thoughts?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message