pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolsoo Park" <piaozhe...@gmail.com>
Subject Re: Review Request 15261: PIG-3555 Initial implementation of Tez combiner optimization
Date Wed, 06 Nov 2013 23:55:45 GMT

This is an automatically generated e-mail. To reply, visit:

(Updated Nov. 6, 2013, 11:55 p.m.)

Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.


Upload a new patch that includes the following changes-
* Adds two Map<OperatorKey, TezEdgeDescriptor>'s to TezOperator.
* Adds combine plans to outbound (map/onfilesortedoutput) instead of inbound (reduce/shufflemergeinput).
This is the same as MR-Pig.
* Adds a few Pig-specific properties to the edge payload to make PigCombiner work.

I still have to go through Mark's comments, but with this patch, combiners seem to work now.
I can see counters in task logs as follows-

Combine input records=3, Combine output records=8

Bugs: PIG-3555

Repository: pig-git


Initial implementation of Tez combiner optimizer. The patch includes the following changes-
* Factored out CombinerOptimizer code into a utility class called CombinerOptimizerUtil. So
both MR and Tez CombinerOptimizer use this utility class instead of duplicating code.
* Introduced a new class called TezEdgeDescriptor that holds combine plans as well as various
edge properties.
* Added TezEdgeDescriptors to TezOperator. Note that I added multiple descriptors for inbound
edges but a single descriptor for all the outbound edges. This is because TezDagBuilder always
creates an edge by connecting predecessors to the current vertex. Please let me know if you
think we should allow multiple descriptors for outbound edges too.
* Refactored some code in TezDagBuilder while touching it.

Diffs (updated)

  src/org/apache/pig/backend/hadoop/executionengine/tez/CombinerOptimizer.java e69de29 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 0b1f3c9 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 45e47b0 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezEdgeDescriptor.java e69de29 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezLauncher.java 3f14644 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperator.java e612d88 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezPrinter.java 5a42ded 
  src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java e69de29

  test/org/apache/pig/test/data/GoldenFiles/TEZC1.gld 925f07e 
  test/org/apache/pig/test/data/GoldenFiles/TEZC2.gld a3974fe 
  test/org/apache/pig/test/data/GoldenFiles/TEZC3.gld a8c942b 
  test/org/apache/pig/test/data/GoldenFiles/TEZC4.gld fb7c903 
  test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld e6cd25e 

Diff: https://reviews.apache.org/r/15261/diff/


ant test-tez passes.
ant test-e2e-tez passes.

I didn't add new test cases, but an e2e test case (Checkin_3) includes an algebraic udf (count)
following group-by. I also manually tested it on a live cluster.


Cheolsoo Park

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message