hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Schmidtke (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MAPREDUCE-6923) YARN Shuffle I/O for small partitions
Date Thu, 03 Aug 2017 12:29:00 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112643#comment-16112643
] 

Robert Schmidtke edited comment on MAPREDUCE-6923 at 8/3/17 12:28 PM:
----------------------------------------------------------------------

Fyi I have benchmarked another version which uses casts instead of the ternary operator using
JMH on my Mac:

{code:java}
package de.schmidtke.java.benchmark;

import java.util.Random;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;

public class TernaryBenchmark {

    @State(Scope.Thread)
    public static class TBState {
        private final Random random = new Random(0);
        public long trans;

        @Setup(Level.Invocation)
        public void setup() {
            trans = random.nextLong();
        }
    }

    @Benchmark
    public int testTernary(TBState tbState) {
        long trans = tbState.trans;
        return Math.min(131072,
                trans > Integer.MAX_VALUE ? Integer.MAX_VALUE : (int) trans);
    }

    @Benchmark
    public int testCast(TBState tbState) {
        long trans = tbState.trans;
        return (int) Math.min((long) 131072, trans);
    }

}
{code}

The results are roughly 1% higher throughput using the cast version, the rest seems about
the same. I'd go with the ternary operator version for better clarity:

{code:none}
Benchmark                                           Mode      Cnt         Score        Error
 Units
TernaryBenchmark.testCast                          thrpt      200  25142779.388 ± 114863.918
 ops/s
TernaryBenchmark.testTernary                       thrpt      200  24829083.072 ±  64009.480
 ops/s
TernaryBenchmark.testCast                           avgt      200        ≈ 10⁻⁷    
           s/op
TernaryBenchmark.testTernary                        avgt      200        ≈ 10⁻⁷    
           s/op
TernaryBenchmark.testCast                         sample  7596374        ≈ 10⁻⁷    
           s/op
TernaryBenchmark.testCast:testCast·p0.00          sample                 ≈ 10⁻⁹   
            s/op
TernaryBenchmark.testCast:testCast·p0.50          sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testCast:testCast·p0.90          sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testCast:testCast·p0.95          sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testCast:testCast·p0.99          sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testCast:testCast·p0.999         sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testCast:testCast·p0.9999        sample                 ≈ 10⁻⁵   
            s/op
TernaryBenchmark.testCast:testCast·p1.00          sample                  0.002         
      s/op
TernaryBenchmark.testTernary                      sample  7469568        ≈ 10⁻⁷    
           s/op
TernaryBenchmark.testTernary:testTernary·p0.00    sample                 ≈ 10⁻⁹   
            s/op
TernaryBenchmark.testTernary:testTernary·p0.50    sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testTernary:testTernary·p0.90    sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testTernary:testTernary·p0.95    sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testTernary:testTernary·p0.99    sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testTernary:testTernary·p0.999   sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testTernary:testTernary·p0.9999  sample                 ≈ 10⁻⁵   
            s/op
TernaryBenchmark.testTernary:testTernary·p1.00    sample                  0.002         
      s/op
TernaryBenchmark.testCast                             ss       10        ≈ 10⁻⁵    
           s/op
TernaryBenchmark.testTernary                          ss       10        ≈ 10⁻⁵    
           s/op
{code}



was (Author: rosch):
Fyi I have benchmarked another version which uses casts instead of the ternary operator using
JMH on my Mac:

{code:java}
package de.schmidtke.java.benchmark;

import java.util.Random;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;

public class TernaryBenchmark {

    @State(Scope.Thread)
    public static class TBState {
        private final Random random = new Random(0);
        public long trans;

        @Setup(Level.Invocation)
        public void setup() {
            trans = random.nextLong();
        }
    }

    @Benchmark
    public int testTernary(TBState tbState) {
        long trans = tbState.trans;
        return Math.min(131072,
                trans > Integer.MAX_VALUE ? Integer.MAX_VALUE : (int) trans);
    }

    @Benchmark
    public int testCast(TBState tbState) {
        long trans = tbState.trans;
        return (int) Math.min((long) 131072, trans);
    }

}
{code}

The results are roughly 1% higher throughput using the cast version, the rest seems about
the same. I'd go with the ternary operator version for better clarity:

{code:other}
Benchmark                                           Mode      Cnt         Score        Error
 Units
TernaryBenchmark.testCast                          thrpt      200  25142779.388 ± 114863.918
 ops/s
TernaryBenchmark.testTernary                       thrpt      200  24829083.072 ±  64009.480
 ops/s
TernaryBenchmark.testCast                           avgt      200        ≈ 10⁻⁷    
           s/op
TernaryBenchmark.testTernary                        avgt      200        ≈ 10⁻⁷    
           s/op
TernaryBenchmark.testCast                         sample  7596374        ≈ 10⁻⁷    
           s/op
TernaryBenchmark.testCast:testCast·p0.00          sample                 ≈ 10⁻⁹   
            s/op
TernaryBenchmark.testCast:testCast·p0.50          sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testCast:testCast·p0.90          sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testCast:testCast·p0.95          sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testCast:testCast·p0.99          sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testCast:testCast·p0.999         sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testCast:testCast·p0.9999        sample                 ≈ 10⁻⁵   
            s/op
TernaryBenchmark.testCast:testCast·p1.00          sample                  0.002         
      s/op
TernaryBenchmark.testTernary                      sample  7469568        ≈ 10⁻⁷    
           s/op
TernaryBenchmark.testTernary:testTernary·p0.00    sample                 ≈ 10⁻⁹   
            s/op
TernaryBenchmark.testTernary:testTernary·p0.50    sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testTernary:testTernary·p0.90    sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testTernary:testTernary·p0.95    sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testTernary:testTernary·p0.99    sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testTernary:testTernary·p0.999   sample                 ≈ 10⁻⁷   
            s/op
TernaryBenchmark.testTernary:testTernary·p0.9999  sample                 ≈ 10⁻⁵   
            s/op
TernaryBenchmark.testTernary:testTernary·p1.00    sample                  0.002         
      s/op
TernaryBenchmark.testCast                             ss       10        ≈ 10⁻⁵    
           s/op
TernaryBenchmark.testTernary                          ss       10        ≈ 10⁻⁵    
           s/op
{code}


> YARN Shuffle I/O for small partitions
> -------------------------------------
>
>                 Key: MAPREDUCE-6923
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6923
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>         Environment: Observed in Hadoop 2.7.3 and above (judging from the source code
of future versions), and Ubuntu 16.04.
>            Reporter: Robert Schmidtke
>            Assignee: Robert Schmidtke
>         Attachments: MAPREDUCE-6923.00.patch
>
>
> When a job configuration results in small partitions read by each reducer from each mapper
(e.g. 65 kilobytes as in my setup: a [TeraSort|https://github.com/apache/hadoop/blob/branch-2.7.3/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/terasort/TeraSort.java]
of 256 gigabytes using 2048 mappers and reducers each), and setting
> {code:xml}
> <property>
>   <name>mapreduce.shuffle.transferTo.allowed</name>
>   <value>false</value>
> </property>
> {code}
> then the default setting of
> {code:xml}
> <property>
>   <name>mapreduce.shuffle.transfer.buffer.size</name>
>   <value>131072</value>
> </property>
> {code}
> results in almost 100% overhead in reads during shuffle in YARN, because for each 65K
needed, 128K are read.
> I propose a fix in [FadvisedFileRegion.java|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/FadvisedFileRegion.java#L114]
as follows:
> {code:java}
> ByteBuffer byteBuffer = ByteBuffer.allocate(Math.min(this.shuffleBufferSize, trans >
Integer.MAX_VALUE ? Integer.MAX_VALUE : (int) trans));
> {code}
> e.g. [here|https://github.com/apache/hadoop/compare/branch-2.7.3...robert-schmidtke:adaptive-shuffle-buffer].
This sets the shuffle buffer size to the minimum value of the shuffle buffer size specified
in the configuration (128K by default), and the actual partition size (65K on average in my
setup). In my benchmarks this reduced the read overhead in YARN from about 100% (255 additional
gigabytes as described above) down to about 18% (an additional 45 gigabytes). The runtime
of the job remained the same in my setup.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message