pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Quang-Nhat HOANG-XUAN" <hxquangn...@gmail.com>
Subject Review Request 23804: PIG-4066 An optimization for ROLLUP operation on Pig
Date Tue, 22 Jul 2014 09:20:02 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23804/
-----------------------------------------------------------

Review request for pig.


Bugs: PIG-4066
    https://issues.apache.org/jira/browse/PIG-4066


Repository: pig


Description
-------

This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of
the work is done in the Map phase of the underlying MapReduce job to generate all possible
intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on
our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design
space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf),
we show that the design space for a ROLLUP implementation allows for a different approach
(in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is
done in the Reduce phase. This patch presents the most efficient implementation we designed
(Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers)
and communication cost.
This patch contains the following features:
1. The new ROLLUP approach: IRG, Hybrid IRG.
2. The PIVOT clause in CUBE operators.
3. Test cases.
The new syntax to use our ROLLUP approach:
alias = CUBE rel BY
{ CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT
pivot_value]}
...]
In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will
be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed
with the default approach.
We have already made some experiments for comparison between our ROLLUP implementation and
the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/


Diffs
-----

  trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1579421 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java
1579421 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
1579421 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java 1579421

  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MultiQueryOptimizer.java
1579421 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java
1579421 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapReduce.java
1579421 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/RollupH2IRGPartitioner.java
PRE-CREATION 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
1579421 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/DotPOPrinter.java
1579421 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java
1579421 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java
1579421 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORollupH2IRGForEach.java
PRE-CREATION 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/util/PlanHelper.java
1579421 
  trunk/src/org/apache/pig/builtin/RollupDimensions.java 1579421 
  trunk/src/org/apache/pig/newplan/logical/DotLOPrinter.java 1579421 
  trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1579421

  trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1579421 
  trunk/src/org/apache/pig/newplan/logical/optimizer/LogicalPlanOptimizer.java 1579421 
  trunk/src/org/apache/pig/newplan/logical/optimizer/LogicalPlanPrinter.java 1579421 
  trunk/src/org/apache/pig/newplan/logical/optimizer/SchemaResetter.java 1579421 
  trunk/src/org/apache/pig/newplan/logical/optimizer/UidResetter.java 1579421 
  trunk/src/org/apache/pig/newplan/logical/relational/LOCogroup.java 1579421 
  trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java 1579421 
  trunk/src/org/apache/pig/newplan/logical/relational/LORollupH2IRGForEach.java PRE-CREATION

  trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 1579421

  trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1579421

  trunk/src/org/apache/pig/newplan/logical/rules/OptimizerUtils.java 1579421 
  trunk/src/org/apache/pig/newplan/logical/rules/RollupH2IRGOptimizer.java PRE-CREATION 
  trunk/src/org/apache/pig/parser/AliasMasker.g 1579421 
  trunk/src/org/apache/pig/parser/AstPrinter.g 1579421 
  trunk/src/org/apache/pig/parser/AstValidator.g 1579421 
  trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1579421 
  trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1579421 
  trunk/src/org/apache/pig/parser/QueryLexer.g 1579421 
  trunk/src/org/apache/pig/parser/QueryParser.g 1579421 
  trunk/test/org/apache/pig/test/TestCubeOperator.java 1579421 
  trunk/test/org/apache/pig/test/TestNewPlanLogToPhyTranslationVisitor.java 1579421 
  trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1579421 

Diff: https://reviews.apache.org/r/23804/diff/


Testing
-------


Thanks,

Quang-Nhat HOANG-XUAN


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message