pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2397) Running TPC-H on Pig
Date Thu, 15 Dec 2011 23:15:30 GMT

    [ https://issues.apache.org/jira/browse/PIG-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170566#comment-13170566
] 

Dmitriy V. Ryaboy commented on PIG-2397:
----------------------------------------

I was curious about the massive difference between what Jie was seeing for Hive and Pig on
Q1, and did a little digging of my own.
I couldn't get the same difference in performance out of the box at all on my cluster -- Hive
ranged between 160 and 240 seconds, while Pig ranged between 290 and 350 (ish) on several
runs of Q1. 

Digging in a little further, I think there are 3 things worth noting:
1) The hive TPC-H scripts set mapred.min.split.size=536870912 while Pig ones do not. This
means Pig will pick up whatever the cluster defaults are, and the difference in # of mappers
will be greatly exaggerated when running on small clusters incapable of running hundreds of
tasks in parallel (task set-up costs will keep accumulating). I recommend this parameter be
set to be the same as the one in Hive TPC-H in PIG-2397, for consistency.

2) We generate a sampling job for an ORDER-BY even when the parallelism of that operator is
set to 1 (so sampling and custom partitioning is useless). That's just free performance gains,
and comes up in many real-life cases, not just benchmarks. We should fix this and get 30 seconds
per job back.

3) When the split sizes are comparable for TPC-H Q1, Hive's tasks finish in about 60 seconds
on average, while Pig takes about 84 seconds. I believe this is due to the fact that Hive
triggers in-mem aggregation and output based on memory utilization; we have a hardcoded MAX_SIZE_CURVAL_CACHE
= 1024. In this particular case, that means Hive's tasks output 4 records (a single aggregation),
while we output 28 (9 aggregations). If we make MAX_SIZE_CURVAL_CACHE configurable, or based
on memory, we can probably improve performance for small records.

D
                
> Running TPC-H on Pig
> --------------------
>
>                 Key: PIG-2397
>                 URL: https://issues.apache.org/jira/browse/PIG-2397
>             Project: Pig
>          Issue Type: Task
>            Reporter: Jie Li
>         Attachments: TPC-H_on_Pig.tgz, pig_tpch.ppt
>
>
> For a class project we developed a whole set of Pig scripts for TPC-H. Our goals are:
> 1) identifying the bottlenecks of Pig's performance especially of its relational operators,
> 2) studying how to write efficient scripts by making full use of Pig Latin's features,
> 3) comparing with Hive's TPC-H results for verifying both 1) and 2).
> We will update the JIRA with our scripts, results and analysis soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message