hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2841) Task level native optimization
Date Mon, 29 Aug 2011 23:13:38 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093291#comment-13093291

Chris Douglas commented on MAPREDUCE-2841:

bq. If the java impl use the similar impl as the c++ one here, the only difference will be
language. right?

Yes, but the language difference includes other overheads (more below).

bq. Sorry, can you explain more about how the c++ can do a better job here for predictable
memory footprint? in the current java impl, all records (no matter which reducer it is going)
are stored in a central byte array. In the c++ impl, on one mapper task, each reducer will
have one corresponding partition bucket which maintains its own memory buffer. From what i
understand, one partition bucket is for one reducer. and all records going to that reducer
from the current maptask are stored there, will be sorted and spilled from there.

Each partition bucket maintins its own memory buffer, so the memory consumed by the collection
framework includes the unused space in all the partition buffers. I'm calling that, possibly
imprecisely, internal fragmentation. The {{RawComparator}} interface also requires that keys
be contiguous, introducing other "waste" if the partition's collection buffer were not copied
whenever it is expanded (as in 0.16; the expansion/copying overhead also harms performance
and makes memory usage hard to predict because both src and dst buffers exist simultaneously),
i.e. a key partially serialized at the end of a slab must be realigned in a new slab. This
happens at the end of the circular buffer in the current implementation, but would happen
on the boundary of every partition collector chunk.

That internal fragmentation creates unused buffer space that "prematurely" triggers a spill
to reclaim the memory. Allocating smaller slabs decreases internal fragmentation, but also
adds an ~8 byte object tracking overhead and GC cycles. In contrast, large allocations (like
the single collection buffer) are placed directly in permgen. The 4 byte overhead per record
to track the partition is a space savings over slabs exactly matching each record size, requiring
at least 8 bytes per record if naively implemented.

The current implementation is oriented toward stuffing the most records into a precisely fixed
amount of memory, and adopts a few assumptions: 1) one should spill as little as possible
2) if spilling is required, at least don't block the mapper 3) packing the most records into
each spill favors MapTasks with combiners. If there are cases (we all acknowledge that there
are) where spilling more often but _faster_ can compensate for that difference, then it's
worth reexamining those assumptions.

> Task level native optimization
> ------------------------------
>                 Key: MAPREDUCE-2841
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>         Environment: x86-64 Linux
>            Reporter: Binglin Chang
>            Assignee: Binglin Chang
>         Attachments: MAPREDUCE-2841.v1.patch, dualpivot-0.patch, dualpivotv20-0.patch
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs emitted by
mapper, therefore sort, spill, IFile serialization can all be done in native code, preliminary
test(on Xeon E5410, jdk6u24) showed promising results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware CRC32C is
used, things can get much faster(1G/s).
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if IdentityMapper(mapper
does nothing) is used.
> There are limitations of course, currently only Text and BytesWritable is supported,
and I have not think through many things right now, such as how to support map side combine.
I had some discussion with somebody familiar with hive, it seems that these limitations won't
be much problem for Hive to benefit from those optimizations, at least. Advices or discussions
about improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), which checks
if key/value type, comparator type, combiner are all compatible, then MapTask can choose to
enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better final results,
and I believe similar optimization can be adopt to reduce task and shuffle too. 

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message