pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy" <rohini.adi...@gmail.com>
Subject Re: Review Request 15261: PIG-3555 Initial implementation of Tez combiner optimization
Date Wed, 06 Nov 2013 22:15:26 GMT

> On Nov. 6, 2013, 7:46 p.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java, line 156
> > <https://reviews.apache.org/r/15261/diff/1/?file=379002#file379002line156>
> >
> >     This + newEdges method does not seem to be right to me. Going with the MR combiner
model, I would expect the combinePlan of the inEdge (we need to tag the TezEdgeDescriptor
inEdges of the predecessor with the actual predecessor and the outedges with the actual successor)
actually connecting these vertexes to be set on this Edge for both in and out. The same combinePlan
should be applicable on both the outbound end (map/onfilesortedoutput) and inbound end (reduce/shufflemergeinput)
of the edge connecting the vertices.
> Cheolsoo Park wrote:
>     Hi Rohini, you're right that method is awfully wrong. Here is what I have in a new
>     # Convert List<TezEdgeDescriptor> to Map<OperatorKey, TezEdgeDescriptor>.
When TezCompiler adds a new operator, it will create a TezEdgeDescriptor on both ways (in
and out). Then, CombinerOptimizer will look up by OperatorKey and set combine plan.
>     # Each vertex has two maps of TezEdgeDescriptors for inbound and outbound edges.
>     >> The same combinePlan should be applicable on both the outbound end (map/onfilesortedoutput)
and inbound end (reduce/shufflemergeinput) of the edge connecting the vertices.
>     I can easily do this but am not sure whether this is right or wrong. I originally
imagine that inbound and outbound can have different combine plans. What do others think?


The map approach should work fine. I caught that easily as I was doing something similar for
split. I will just rebase my patch to use what you have.

AFAIK, There is only one combine plan for a MR Job in pig (pig.combinePlan). Haven't see a
different combine plan for map and a different one for reduce even though we can do it even
in MR framework. Not able to think of a use case where we would have different combine plans
on the two sides of the edge. May be others could come up with something. 

- Rohini

This is an automatically generated e-mail. To reply, visit:

On Nov. 6, 2013, 11:04 a.m., Cheolsoo Park wrote:
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15261/
> -----------------------------------------------------------
> (Updated Nov. 6, 2013, 11:04 a.m.)
> Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
> Bugs: PIG-3555
>     https://issues.apache.org/jira/browse/PIG-3555
> Repository: pig-git
> Description
> -------
> Initial implementation of Tez combiner optimizer. The patch includes the following changes-
> * Factored out CombinerOptimizer code into a utility class called CombinerOptimizerUtil.
So both MR and Tez CombinerOptimizer use this utility class instead of duplicating code.
> * Introduced a new class called TezEdgeDescriptor that holds combine plans as well as
various edge properties.
> * Added TezEdgeDescriptors to TezOperator. Note that I added multiple descriptors for
inbound edges but a single descriptor for all the outbound edges. This is because TezDagBuilder
always creates an edge by connecting predecessors to the current vertex. Please let me know
if you think we should allow multiple descriptors for outbound edges too.
> * Refactored some code in TezDagBuilder while touching it.
> Diffs
> -----
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java
>   src/org/apache/pig/backend/hadoop/executionengine/tez/CombinerOptimizer.java e69de29

>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 0b1f3c9 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 45e47b0 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezEdgeDescriptor.java e69de29

>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezLauncher.java 3f14644 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperator.java e612d88 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezPrinter.java 5a42ded 
>   src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java e69de29

>   test/org/apache/pig/test/data/GoldenFiles/TEZC1.gld 925f07e 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC2.gld a3974fe 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC3.gld a8c942b 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC4.gld fb7c903 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld e6cd25e 
> Diff: https://reviews.apache.org/r/15261/diff/
> Testing
> -------
> ant test-tez passes.
> ant test-e2e-tez passes.
> I didn't add new test cases, but an e2e test case (Checkin_3) includes an algebraic udf
(count) following group-by. I also manually tested it on a live cluster.
> Thanks,
> Cheolsoo Park

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message