pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pallavi Rao" <pallavi....@inmobi.com>
Subject Re: Review Request 40743: PIG-4709 Improve performance of GROUPBY operator on Spark
Date Wed, 09 Dec 2015 05:49:58 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40743/
-----------------------------------------------------------

(Updated Dec. 9, 2015, 5:49 a.m.)


Review request for pig, Mohit Sabharwal and Xuefu Zhang.


Changes
-------

Removed LogicalPlanGenerator.g changes.


Bugs: PIG-4709
    https://issues.apache.org/jira/browse/PIG-4709


Repository: pig-git


Description
-------

Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the grouped data
is consumed by subsequent operations to perform algebraic operations, this is sub-optimal
as there is lot of shuffle traffic.
The Spark Plan must be optimized to use reduceBy, where possible, so that a combiner is used.

Introduced a combiner optimizer that does the following:
    // Checks for algebraic operations and if they exist.
    // Replaces global rearrange (cogroup) with reduceBy as follows:
    // Input:
    // foreach (using algebraicOp)
    //   -> packager
    //      -> globalRearrange
    //          -> localRearrange
    // Output:
    // foreach (using algebraicOp.Final)
    //   -> reduceBy (uses algebraicOp.Intermediate)
    //      -> foreach (using algebraicOp.Initial)
    //          -> localRearrange


Diffs (updated)
-----

  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java
f8c1658 
  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORollupHIIForEach.java
aca347d 
  src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java a4dbadd 
  src/org/apache/pig/backend/hadoop/executionengine/spark/converter/GlobalRearrangeConverter.java
5f74992 
  src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LocalRearrangeConverter.java
9ce0492 
  src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PigSecondaryKeyComparatorSpark.java
PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PreCombinerLocalRearrangeConverter.java
PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/spark/converter/ReduceByConverter.java
PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/spark/operator/POReduceBySpark.java PRE-CREATION

  src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/SparkCombinerOptimizer.java
PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java 6b66ca1

  src/org/apache/pig/backend/hadoop/executionengine/util/SecondaryKeyOptimizerUtil.java 546d91e

  test/org/apache/pig/test/TestCombiner.java df44293 

Diff: https://reviews.apache.org/r/40743/diff/


Testing
-------

The patch unblocked one UT in TestCombiner. Added another UT in the same class. Also did some
manual testing.


Thanks,

Pallavi Rao


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message