hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4755) Rewrite MapOutputBuffer to use direct buffers & allow parallel sort+collect
Date Mon, 29 Oct 2012 15:58:12 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486098#comment-13486098
] 

Todd Lipcon commented on MAPREDUCE-4755:
----------------------------------------

Hi Gopal,

I haven't looked at the patch yet, but how do you deal with compressing the spill files? Also,
how much "handling your own buffers" do you anticipate obviating? We obviously still need
to have bounded memory usage -- just using many GBs and letting the OS page stuff out is a
recipe for swap usage and the machine grinding to a halt.

I agree that we could save CPU using direct buffers (both by avoiding copies and by using
the more efficient CRC code), but I'm not sold on the mmap part. The other improvement you
should look into if you're interested in improving the sort process would be to do the sort
on L2-sized chunks and then merge at spill time. Right now our sort is horribly cache-inefficient.
                
> Rewrite MapOutputBuffer to use direct buffers & allow parallel sort+collect
> ---------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4755
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4755
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>         Environment: Ubuntu 12.10 x86_64 (Bulldozer 8-core)
>            Reporter: Gopal V
>            Assignee: Gopal V
>              Labels: optimization, sort
>
> The MapOutputBuffer has been written with a very severe constraint on the amount of memory
it can consume. This results in code that has to page-in & page-out (i.e spill) data as
it passes through the map buffers.
> With the advent of the java.nio package, there is a fast and portable MMap alternative
to handling your own buffers. This exists outside the GC space of Java and yet provides decently
fast memory access to all the data.
> The suggestion is that using mmap() direct buffers can be faster when a spill is involved
and simpler than the current spill logic when given enough address space & uses the buffer
caches to deliver best effort I/O.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message