hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Chen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file
Date Mon, 13 Dec 2010 18:01:03 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970925#action_12970925
] 

Scott Chen commented on MAPREDUCE-2212:
---------------------------------------

I have done some experiments on the latency.
In the experiment, 500mb of data are read from the disk, compressed and written to the disk.
It shows that the throughput of LZO is slightly worse than no codec. But they are very close.

I think for latency, there is no much difference.
The question here is about the trade-off between disk IO and CPU.
Using LZO uses more CPU (I don' have number for this) but can save disk IO to 50%.

{code}
================================================
Initialize codec lzo 
Finished. Time: 10278 ms
File size: 239.19908142089844MB Compression ratio: 0.501636832
Throughput: 47.50741875851333MB/s
================================================
Initialize codec gz
Finished. Time: 38132 ms
File size: 161.91629219055176MB Compression ratio: 0.339563076
Throughput: 12.805025962446239MB/s
================================================
Initialize codec none
Finished. Time: 8783 ms
File size: 476.837158203125MB Compression ratio: 1.0
Throughput: 55.59390299442104MB/s
================================================
{code}

Here is a simple example that produces these numbers.
{code}
public class TestCodecDiskIO extends TestCase {
  
  Log LOG = LogFactory.getLog(TestCodecDiskIO.class);

  static {
    System.setProperty(Compression.Algorithm.CONF_LZO_CLASS,
        "com.hadoop.compression.lzo.LzoCodec");
  }
  
  public void testCodecWrite()
      throws Exception {
    File dataFile = new File("/home/schen/data/test_data");
    print("Data file:" + dataFile.getName());
    InputStream in = new BufferedInputStream(new FileInputStream(dataFile));
    int dataLength = 5 * 1024 * 1024 * 1024;
    byte buff[] = new byte[dataLength];
    print("Start reading file. Read length = " + dataLength);
    long start = now();
    in.read(buff);
    long timeSpent = now() - start;
    in.close();
    print("Reading time: " + timeSpent);
    
    byte buff2[] = new byte[dataLength];
    start = now();
    System.arraycopy(buff, 0, buff2, 0, buff.length);
    timeSpent = now() - start;
    print("Memory copy time: " + timeSpent);
    
    int count = 3;

    for (int i = 0; i < count; ++i) {
      for (Compression.Algorithm algo : Compression.Algorithm.values()) {
        print("================================================");
        print("Initialize codec " + algo.getName());
        CompressionCodec codec = algo.getCodec();
        File temp = File.createTempFile("test", "", new File("/tmp"));
        temp.deleteOnExit();
        FileOutputStream fout = new FileOutputStream(temp);
        BufferedOutputStream bout = new BufferedOutputStream(fout);
        OutputStream out;
        if (codec != null) {
          out = codec.createOutputStream(bout);
        } else {
          out = bout;
        }
        print("Start writing");
        start = now();
        out.write(buff);
        out.flush();
        fout.getFD().sync();
        out.close();
        timeSpent = now() - start;
        print("Finished. Time: " + timeSpent + " ms");
        print("File size: " + (temp.length() / 1024.0 / 1024.0) + "MB" +
            " Compression ratio: " + temp.length() / (double)(dataLength));
        print(("Throughput: " + dataLength / (double)(timeSpent) / 1024.0) + "MB/s");
      }
    }
    print("================================================");
  }

  private void print(String s) {
    System.out.println(s);
  }
  private long now() {
    return System.currentTimeMillis();
  }
}
{code}

> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the
final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the
compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message