hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2841) Task level native optimization
Date Mon, 29 Aug 2011 02:19:38 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092596#comment-13092596

Chris Douglas commented on MAPREDUCE-2841:

{quote}Just an idea, what if memory related configurations can be a random variable, 
with mean & variance? Can this leads to better resource utilization? A fixed memory bound

always means application will request more memory than they really need.
I think in many cases predictable memory control is enough, rather than precise memory control,

since it's impractical. We can use some dynamic memory if it is in a predicable range, 
for example +/-20%, +-30%, etc.{quote}

The fixed memory bound definitely causes resource waste. Not only will users ask for more
memory than they need (particularly since most applications are not tightly tuned), but in
our clusters, users will just as often request far too little. Because tasks' memory management
is uniformly specified within a job, there isn't even an opportunity for the framework to
adapt to skew.

The random memory config is an interesting idea, but failed tasks are regrettable and expensive
waste. For pipelines with SLAs, "random" failures will probably motivate users to jack up
their memory requirements to match the range (which, if configurable, seems to encode the
same contract). The precise specification was avoiding OOMs; because the collection is across
a JNI boundary, a "relaxed" predictable memory footprint could be easier to deploy, assuming
a hard limit in the native code to avoid swapping.

Thanks for the detail on the collection data structures. That makes it much easier to orient
oneself in the code.

A few quick notes on your [earlier|https://issues.apache.org/jira/browse/MAPREDUCE-2841?focusedCommentId=13086973&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13086973]

Adding the partition to the record, again, was to make the memory more predictable. The overhead
in Java tracking thousands of per-partition buckets (many going unused) was worse than the
per-record overhead, particularly in large jobs. Further, user comparators are often horribly
inefficient, so the partition comparison and related hit to its performance was in the noise.
The cache miss is real, but hard to reason about without leaving the JVM.

The decorator-based stream is/was? required by the serialization interface. While the current
patch only supports records with a known serialized length, the contract for other types is
more general. Probably too general, but users with occasional several-hundred MB records (written
in chunks) exist. Supporting that in this implementation is not a critical use case, since
they can just use the existing collector. Tuning this to handle memcmp types could also put
the burden of user comparators on the serialization frameworks, which is probably the best
strategy. Which is to say: obsoleting the existing collection framework doesn't require that
this support all of its use cases, if some of those can be worked around more competently
elsewhere. If its principal focus is performance, it may make sense not to support inherently
slow semantics.

Which brings up a point: what is the scope of this JIRA? A full, native task runtime is a
formidable job. Even if it only supported memcmp key types, no map-side combiner, no user-defined
comparators, and records smaller than its intermediate buffer, such an improvement would still
cover a lot of user jobs. It might make sense to commit that subset as optional functionality
first, then iterate based on feedback.

> Task level native optimization
> ------------------------------
>                 Key: MAPREDUCE-2841
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>         Environment: x86-64 Linux
>            Reporter: Binglin Chang
>            Assignee: Binglin Chang
>         Attachments: MAPREDUCE-2841.v1.patch, dualpivot-0.patch, dualpivotv20-0.patch
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs emitted by
mapper, therefore sort, spill, IFile serialization can all be done in native code, preliminary
test(on Xeon E5410, jdk6u24) showed promising results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware CRC32C is
used, things can get much faster(1G/s).
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if IdentityMapper(mapper
does nothing) is used.
> There are limitations of course, currently only Text and BytesWritable is supported,
and I have not think through many things right now, such as how to support map side combine.
I had some discussion with somebody familiar with hive, it seems that these limitations won't
be much problem for Hive to benefit from those optimizations, at least. Advices or discussions
about improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), which checks
if key/value type, comparator type, combiner are all compatible, then MapTask can choose to
enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better final results,
and I believe similar optimization can be adopt to reduce task and shuffle too. 

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message