pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pallavi Rao (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-4709) Improve performance of GROUPBY operator on Spark
Date Tue, 08 Dec 2015 06:02:11 GMT

     [ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pallavi Rao updated PIG-4709:
-----------------------------
    Attachment: PIG-4709-v1.patch

Outlining the approach here:
Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup (via POGlobalRearrange).
When the grouped data is consumed by subsequent operations to perform algebraic operations,
this is sub-optimal as there is a lot of shuffle traffic. It is recommended that “reduceBy/combineBy/aggregateBy”
operators of Spark be used in such cases.  For this patch, reduceBy operator of Spark was
chosen over combineBy primarily to enable splitting of plan into multiple operators rather
than stuffing all the operations (map, combine, merge) into a single operator. Hence, this
patch introduces a new physical operator in Spark called POReduceBySpark. This class extends
POForEach mostly to enable consuming from Input Plans. The spark plan is processed by SparkCombinerOptimizer
to modify the plan to as follows:
{code}
 Checks for algebraic operations and if they exist.
    Replaces global rearrange (cogroup) with reduceBy as follows:
    Input:
    foreach (using algebraicOp)
       -> packager
          -> globalRearrange
              -> localRearrange
    Output:
     foreach (using algebraicOp.Final)
       -> reduceBy (uses algebraicOp.Intermediate)
          -> foreach (using algebraicOp.Initial)
              -> localRearrange
{code}
Example : If AVG(num) is the algebraic operation. The first foreach(initial) will emit (num,1).
Given 2 tuples (num1, 1) and (num2, 1), ReduceBy(intermediate) will emit (num1+num2, 2) and
so on. The final foreach will emit ((num1+num2….)/count).
Existing CombinerOptimizerUtil and the methods there in have been used heavily to perform
the optimization.

This optimization can be turned off by setting  pig.exec.nocombiner to ’true'

> Improve performance of GROUPBY operator on Spark
> ------------------------------------------------
>
>                 Key: PIG-4709
>                 URL: https://issues.apache.org/jira/browse/PIG-4709
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Pallavi Rao
>            Assignee: Pallavi Rao
>              Labels: spork
>             Fix For: spark-branch
>
>         Attachments: PIG-4709-v1.patch, PIG-4709.patch
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the grouped
data is consumed by subsequent operations to perform algebraic operations, this is sub-optimal
as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a combiner
is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message