pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pallavi Rao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
Date Fri, 27 Nov 2015 11:27:11 GMT

    [ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029780#comment-15029780
] 

Pallavi Rao commented on PIG-4709:
----------------------------------

Before patch: 
{code}
2015-11-27 14:04:16,811 [main] INFO  org.apache.pig.tools.pigstats.spark.SparkPigStats - 
EexcutorDeserializeTime : 26668
2015-11-27 14:04:16,811 [main] INFO  org.apache.pig.tools.pigstats.spark.SparkPigStats - 
ExecutorRunTime : 110938
...
2015-11-27 14:04:16,812 [main] INFO  org.apache.pig.tools.pigstats.spark.SparkPigStats - 
ShuffleBytesWritten : 21465486
2015-11-27 14:04:16,812 [main] INFO  org.apache.pig.tools.pigstats.spark.SparkPigStats - 
ShuffleWriteTime : 470661000
{code}

After patch:
{code}
2015-11-27 13:58:52,205 [main] INFO  org.apache.pig.tools.pigstats.spark.SparkPigStats - 
EexcutorDeserializeTime : 20601
2015-11-27 13:58:52,205 [main] INFO  org.apache.pig.tools.pigstats.spark.SparkPigStats - 
ExecutorRunTime : 75101
2015-11-27 13:58:52,205 [main] INFO  org.apache.pig.tools.pigstats.spark.SparkPigStats - 
ResultSize : 12024
...
2015-11-27 13:58:52,205 [main] INFO  org.apache.pig.tools.pigstats.spark.SparkPigStats - 
ShuffleBytesWritten : 1046
2015-11-27 13:58:52,205 [main] INFO  org.apache.pig.tools.pigstats.spark.SparkPigStats - 
ShuffleWriteTime : 3486000
{code}


> Improve performance of GROUPBY operator on Spark
> ------------------------------------------------
>
>                 Key: PIG-4709
>                 URL: https://issues.apache.org/jira/browse/PIG-4709
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Pallavi Rao
>            Assignee: Pallavi Rao
>              Labels: spork
>             Fix For: spark-branch
>
>         Attachments: PIG-4709.patch
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the grouped
data is consumed by subsequent operations to perform algebraic operations, this is sub-optimal
as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a combiner
is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message