Return-Path: Delivered-To: apmail-pig-dev-archive@www.apache.org Received: (qmail 90739 invoked from network); 2 Mar 2011 01:45:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Mar 2011 01:45:00 -0000 Received: (qmail 81246 invoked by uid 500); 2 Mar 2011 01:44:59 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 81221 invoked by uid 500); 2 Mar 2011 01:44:59 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 81211 invoked by uid 500); 2 Mar 2011 01:44:59 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 81207 invoked by uid 99); 2 Mar 2011 01:44:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2011 01:44:59 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2011 01:44:59 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 003824AB5F for ; Wed, 2 Mar 2011 01:44:39 +0000 (UTC) Date: Wed, 2 Mar 2011 01:44:38 +0000 (UTC) From: "Olga Natkovich (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: <674305287.6755.1299030278997.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1223769532.3104.1297191777435.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] Updated: (PIG-1846) optimize queries like - count distinct users for each gender MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PIG-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1846: -------------------------------- Fix Version/s: 0.10 > optimize queries like - count distinct users for each gender > ------------------------------------------------------------ > > Key: PIG-1846 > URL: https://issues.apache.org/jira/browse/PIG-1846 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.9.0 > Reporter: Thejas M Nair > Fix For: 0.10 > > > The pig group operation does not usually have to deal with skew on the group-by keys if the foreach statement that works on the results of group has only algebraic functions on the bags. But for some queries like the following, skew can be a problem - > {code} > user_data = load 'file' as (user, gender, age); > user_group_gender = group user_data by gender parallel 100; > dist_users_per_gender = foreach user_group_gender > { > dist_user = distinct user_data.user; > generate group as gender, COUNT(dist_user) as user_count; > } > {code} > Since there are only 2 distinct values of the group-by key, only 2 reducers will actually get used in current implementation. ie, you can't get better performance by adding more reducers. > Similar problem is there when the data is skewed on the group key. With current implementation, another problem is that pig and MR has to deal with records with extremely large bags that have the large number of distinct user names, which results in high memory utilization and having to spill the bags to disk. > The query plan should be modified to handle the skew in such cases and make use of more reducers. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira