pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thejas M Nair (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-1846) optimize queries like - count distinct users for each gender
Date Fri, 10 Jun 2011 21:31:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047476#comment-13047476

Thejas M Nair commented on PIG-1846:

bq. yeah I was just using short-hand with the distinct thing, and assumed you would know what
I meant
I didn't realize the mistake when I wrote the example. But short hand is more readable, i
have created a PIG-2117 to discuss supporting that syntax. 

bq.  Regarding two distincts – we can run the initial group-bys twice, and join?
Yes, that will work. 

If the udf FUNC is algebraic and FUNC.Initial() returns something that is smaller than its
argument (eg, COUNT), a further optimization would be -

in = FOREACH in GENERATE *, ALGFUNC$Initial(c4) as init;
gby_dist = GROUP in BY (c1, c2, c3) PARALLEL 100;
res_dist = FOREACH gby_dist GENERATE 
  group.c1, group.c2, FUNC.Initial(c3),
  ALGFUNC$Intermed(in.init) as intermed;
gby = GROUP res_dist BY (c1, c2) PARALLEL 100;
  FLATTEN(group) as (c1, c2),
Where FUNC2 is like ALGFUNC2 described earlier, having FUNC2.Initial same as FUNC.Intermed

> optimize queries like - count distinct users for each gender
> ------------------------------------------------------------
>                 Key: PIG-1846
>                 URL: https://issues.apache.org/jira/browse/PIG-1846
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Thejas M Nair
>             Fix For: 0.10
> The pig group operation does not usually have to deal with skew on the group-by keys
if the foreach statement that works on the results of group has only algebraic functions on
the bags. But for some queries like the following, skew can be a problem -
> {code}
> user_data = load 'file' as (user, gender, age);
> user_group_gender = group user_data by gender parallel 100;
> dist_users_per_gender = foreach user_group_gender 
>                         { 
>                              dist_user = distinct user_data.user; 
>                              generate group as gender, COUNT(dist_user) as user_count;
>                         }
> {code}
> Since there are only 2 distinct values of the group-by key, only 2 reducers will actually
get used in current implementation. ie, you can't get better performance by adding more reducers.
> Similar problem is there when the data is skewed on the group key. With current implementation,
another problem is that pig and MR has to deal with records with extremely large bags that
have the large number of distinct user names, which results in high memory utilization and
having to spill the bags to disk.
> The query plan should be modified to handle the skew in such cases and make use of more

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message