Return-Path: Delivered-To: apmail-hadoop-pig-dev-archive@www.apache.org Received: (qmail 12234 invoked from network); 11 Nov 2009 00:00:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Nov 2009 00:00:52 -0000 Received: (qmail 38501 invoked by uid 500); 11 Nov 2009 00:00:51 -0000 Delivered-To: apmail-hadoop-pig-dev-archive@hadoop.apache.org Received: (qmail 38451 invoked by uid 500); 11 Nov 2009 00:00:51 -0000 Mailing-List: contact pig-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@hadoop.apache.org Delivered-To: mailing list pig-dev@hadoop.apache.org Received: (qmail 38441 invoked by uid 99); 11 Nov 2009 00:00:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Nov 2009 00:00:51 +0000 X-ASF-Spam-Status: No, hits=-10.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Nov 2009 00:00:49 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id A4B63234C046 for ; Tue, 10 Nov 2009 16:00:28 -0800 (PST) Message-ID: <167567118.1257897628647.JavaMail.jira@brutus> Date: Wed, 11 Nov 2009 00:00:28 +0000 (UTC) From: "Viraj Bhat (JIRA)" To: pig-dev@hadoop.apache.org Subject: [jira] Created: (PIG-1081) PigCookBook use of PARALLEL keyword MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 PigCookBook use of PARALLEL keyword ----------------------------------- Key: PIG-1081 URL: https://issues.apache.org/jira/browse/PIG-1081 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.5.0 Reporter: Viraj Bhat Fix For: 0.5.0 Hi all, I am looking at some tips for optimizing Pig programs (Pig Cookbook) using the PARALLEL keyword. http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html#Use+PARALLEL+Keyword We know that currently Pig 0.5 uses Hadoop 20 (as its default) which launches 1 reducer for all cases. In this documentation we state that: * * 0.9, this documentation was valid for HoD (Hadoop on Demand) where you are creating your own Hadoop clusters, but if you are using: Either the Capacity Scheduler http://hadoop.apache.org/common/docs/current/capacity_scheduler.html or the Fair Share Scheduler http://hadoop.apache.org/common/docs/current/fair_scheduler.html , these numbers could mean that you are using around 90% of your reducer slots in your machine. We should change this to something like: The number of reducers you may need for a particular construct in Pig which forms a Map Reduce boundary depends entirely on your data and the number of intermediate keys you are generating in your mappers. In best cases we have seen that a reducer processing about 500 MB of data behaves efficiently. Additionally it is hard to define the optimum number of reducers, since it completely depends on the paritioner and distribution of map (combiner) output keys. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.