Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Thu, 20 Oct 2011 19:30:10 +0000 (UTC)
From: "Todd Lipcon (Commented) (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: 
 <794720935.17021.1319139010844.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <1613584465.16345.1319131934665.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (MAPREDUCE-3235) Improve CPU cache behavior in
 map side sort
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAPREDUCE-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131934#comment-13131934 ] 

Todd Lipcon commented on MAPREDUCE-3235:
----------------------------------------

bq. -1 for the second. The key distribution of terasort, which results in compare != 0 most of the time, is an anomaly. In my experience, where skews etc are almost always a fact of life, more compares return 0 than non-zero.

Definitely worth considering. Like Chris said, this comparison is practically free - we don't have to delegate to any "proxy objects" as you put it. The proxies have to be 4-byte strings. Since we already do a compare on the PARTITION part of the metadata, making that comparison an 8-byte compare instead of a 4-byte compare doesn't really cost anything. So, the only real cost is the extra accounting overhead in the buffer - 20 bytes per record instead of 16. One optimization there would be to pack the PARTITION field more tightly - in most MR jobs, we have <256 reducers, so partition could be a single byte. Since we know the number of reducers up front, we could easily trade-off space between the partition ID and the comparison proxy.

bq. What's the API look like?

Currently I added a marker interface with a single method: getPrefix(byte[] dst, int off, int length). If the key type implements this interface, getPrefix() is called by collect() to copy the comparison proxy into the kvmeta buffer. I was thinking last night that it would be better to delegate to the Serializer implementation there, though. I just did the above for expediency last night while hacking this together. Alternatively, it might make sense to add another interface like how RawComparator is done.

bq. Both of these would be pretty esoteric config knobs

I imagine most of the "stock writables" we have in Hadoop could easily implement this - eg Text, BytesWritable, LongWritable, etc. Frameworks like Pig/Hive could get it in as well. Application programmers implementing their own key types already tend to implement RawComparators, so allowing them to implement another simple method for a good CPU boost doesn't seem too bad.
                
> Improve CPU cache behavior in map side sort
> -------------------------------------------
>
>                 Key: MAPREDUCE-3235
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3235
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>
> When running oprofile on a terasort workload, I noticed that a large amount of CPU usage was going to MapTask$MapOutputBuffer.compare. Upon disassembling this and looking at cycle counters, most of the cycles were going to memory loads dereferencing into the array of key-value data -- implying expensive cache misses. This can be avoided as follows:
> - rather than simply swapping indexes into the kv array, swap the entire meta entries in the meta array. Swapping 16 bytes is only negligibly slower than swapping 4 bytes. This requires adding the value-length into the meta array, since we used to rely on the previous-in-the-array meta entry to determine this. So we replace INDEX with VALUELEN and avoid one layer of indirection.
> - introduce an interface which allows key types to provide a 4-byte comparison proxy. For string keys, this can simply be the first 4 bytes of the string. The idea is that, if stringCompare(key1.proxy(), key2.proxy()) != 0, then compare(key1, key2) should have the same result. If the proxies are equal, the normal comparison method is used. We then include the 4-byte proxy as part of the metadata entry, so that for many cases the indirection into the data buffer can be avoided.
> On a terasort benchmark, these optimizations plus an optimization to WritableComparator.compareBytes dropped the aggregate mapside CPU millis by 40%, and the compare() routine mostly dropped off the oprofile results.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira