Return-Path: X-Original-To: apmail-pig-dev-archive@www.apache.org Delivered-To: apmail-pig-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 82948D088 for ; Wed, 5 Sep 2012 07:02:13 +0000 (UTC) Received: (qmail 32402 invoked by uid 500); 5 Sep 2012 07:02:13 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 31998 invoked by uid 500); 5 Sep 2012 07:02:09 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 31953 invoked by uid 500); 5 Sep 2012 07:02:08 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 31933 invoked by uid 99); 5 Sep 2012 07:02:08 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Sep 2012 07:02:08 +0000 Date: Wed, 5 Sep 2012 18:02:08 +1100 (NCT) From: "Dmitriy V. Ryaboy (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: <80898711.37716.1346828528153.JavaMail.jiratomcat@arcas> In-Reply-To: <1731006641.78830.1342737335351.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Commented] (PIG-2829) Use partial aggregation more aggresively MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448528#comment-13448528 ] Dmitriy V. Ryaboy commented on PIG-2829: ---------------------------------------- Jie, sorry I missed this ticket before. As you may have seen, I completely reimplemented this whole chunk of code in PIG-2888. Can you rerun your benchmarks and see if some of the improvements you propose here should be applied to the code developed in that ticket? > Use partial aggregation more aggresively > ---------------------------------------- > > Key: PIG-2829 > URL: https://issues.apache.org/jira/browse/PIG-2829 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.10.0 > Reporter: Jie Li > Attachments: 2829.1.patch, 2829.2.patch, 2829.separate.options.patch, pigmix-10G.png, tpch-10G.png > > > Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature in Pig 0.10 that will perform aggregation within map function. The main advantage against combiner is it avoids de/serializing and sorting the data, and it can auto disable itself if the data reduction rate is low. Currently it's disabled by default. > To leverage the power of PartialAgg more aggressively, several things need to be revisited: > 1. The threshold of auto-disabling. Currently each mapper looks at first 1k (hard-coded) records to see if there's enough data size reduction (defaults to 10x, configurable). The check would happen earlier if the hash table gets full before processing the 1k records (hash table size is controlled by pig.cachedbag.memusage). We might want to relax these thresholds. > 2. Dependency on the combiner. Currently the PartialAgg won't work without a combiner following it, so we need to provide separate options to enable each independently. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira