pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Zhang <zjf...@gmail.com>
Subject Optimization opportunity for group by followed by join on the same key ?
Date Thu, 05 Mar 2015 09:36:05 GMT
Hi folks,

Here's my pig script:

*    a = load 'pig/input' as (x:int, y:chararray);*

*    b = load 'pig/input1' as (x:int, y:chararray);*

*    c = group a by x;*

*    d = foreach c generate group as x, COUNT($1) as cnt;*

*    d = join d by x, b by x;*

*    store d into 'pig/output';*

 I use tez as the execution engine and notice that pig would convert it to
one dag with 4 vertices as following. But I think 3 vertices should be
sufficient. Because the group by and join are using the same key
So I think vertex (scop_39) is not necessary, we don't need to repartition
the data again. The only impact on converting 4 vertices to 3 vertices may
be on the parallelism of vertex (scope_41). Not sure how much the
performance difference between
these 2 methods, but think this could be a potential optimization.

[image: Inline image 1]

Best Regards

Jeff Zhang

  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message