hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4755) Rewrite MapOutputBuffer to use direct buffers & allow parallel sort+collect
Date Tue, 30 Oct 2012 08:56:12 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486748#comment-13486748

Gopal V commented on MAPREDUCE-4755:

Quicksorting a 64Mb chunk off an mmap() turned out to be slow mostly because of the overhead
of the ByteBuffers (i.e DatInputBuffer needs a ByteBuffer version as well). Quicksort is great
with paging, because it does mostly linear scans on the partitions.

Actually, MAPREDUCE-3235 does some of the things I discovered on my own. The first being the
kvmeta swapper, which seemed to get no benefit out of swapping single ints (KVINDEX), the
indirection was a complete waste of CPU and cache lines - I used the INDEX space to dump in
the vallen into the buffer instead.

Perhaps I'm being biased by my own hardware here, but since the cache lines are going to be
64 bytes wide (at least on any serious hardware), there was no real benefit from making the
data only 32 byte wide anyway, while sorting. 

I think I need to redesign this into an async pipeline. I *really* hope that the popular Comparator
implementations are thread-safe.

> Rewrite MapOutputBuffer to use direct buffers & allow parallel sort+collect
> ---------------------------------------------------------------------------
>                 Key: MAPREDUCE-4755
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4755
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>         Environment: Ubuntu 12.10 x86_64 (Bulldozer 8-core)
>            Reporter: Gopal V
>            Assignee: Gopal V
>              Labels: optimization, sort
> The MapOutputBuffer has been written with a very severe constraint on the amount of memory
it can consume. This results in code that has to page-in & page-out (i.e spill) data as
it passes through the map buffers.
> With the advent of the java.nio package, there is a fast and portable MMap alternative
to handling your own buffers. This exists outside the GC space of Java and yet provides decently
fast memory access to all the data.
> The suggestion is that using mmap() direct buffers can be faster when a spill is involved
and simpler than the current spill logic when given enough address space & uses the buffer
caches to deliver best effort I/O.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message