hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4755) Rewrite MapOutputBuffer to use direct buffers & allow parallel sort+collect
Date Mon, 29 Oct 2012 20:06:12 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486304#comment-13486304

Todd Lipcon commented on MAPREDUCE-4755:

bq. This is rather rough at the moment. Since I'm spilling via mmap(), I can't compress the
spill till I am done sorting it.
bq. Considering my experiment was to only sort the kvmeta here and not the kvbuffer, compression
would cause significant trouble because I assume I can seek & read fast within the MappedByteBuffer.

Right -- we definitely don't want to be quicksorting a file which is mmapped.

bq. I did look at sorting L2 sized chunks - which raises an interesting question, the comparator
is responsible for blowing off the cache, the actual data (in say, a terasort) is actually
not staying in cache during the loops

Right, the trick would be to actually sort the _data_ in L2 sized chunks, not just do the
indirection that we do now. For example:

- While data arrives:
-- Accumulate 2MB of data and associated indexes
-- Indirect-sort this data, which fits inside cache
-- 'Spill to RAM' -- actually rearrange the data to be in sorted order, and drop the indexes
-- If RAM is full:
--- merge the sorted segments in RAM to disk

There is actually an implementation of this in the facebook branch on github, if you want
to take a look. I also had done some prototyping around improving cache efficiency in https://issues.apache.org/jira/browse/MAPREDUCE-3235
> Rewrite MapOutputBuffer to use direct buffers & allow parallel sort+collect
> ---------------------------------------------------------------------------
>                 Key: MAPREDUCE-4755
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4755
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>         Environment: Ubuntu 12.10 x86_64 (Bulldozer 8-core)
>            Reporter: Gopal V
>            Assignee: Gopal V
>              Labels: optimization, sort
> The MapOutputBuffer has been written with a very severe constraint on the amount of memory
it can consume. This results in code that has to page-in & page-out (i.e spill) data as
it passes through the map buffers.
> With the advent of the java.nio package, there is a fast and portable MMap alternative
to handling your own buffers. This exists outside the GC space of Java and yet provides decently
fast memory access to all the data.
> The suggestion is that using mmap() direct buffers can be faster when a spill is involved
and simpler than the current spill logic when given enough address space & uses the buffer
caches to deliver best effort I/O.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message