Return-Path: Delivered-To: apmail-hadoop-pig-dev-archive@www.apache.org Received: (qmail 95258 invoked from network); 9 Apr 2009 20:40:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Apr 2009 20:40:34 -0000 Received: (qmail 16640 invoked by uid 500); 9 Apr 2009 20:40:33 -0000 Delivered-To: apmail-hadoop-pig-dev-archive@hadoop.apache.org Received: (qmail 16607 invoked by uid 500); 9 Apr 2009 20:40:33 -0000 Mailing-List: contact pig-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@hadoop.apache.org Delivered-To: mailing list pig-dev@hadoop.apache.org Received: (qmail 16595 invoked by uid 99); 9 Apr 2009 20:40:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Apr 2009 20:40:33 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Apr 2009 20:40:33 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id EAAFB234C054 for ; Thu, 9 Apr 2009 13:40:12 -0700 (PDT) Message-ID: <914395794.1239309612959.JavaMail.jira@brutus> Date: Thu, 9 Apr 2009 13:40:12 -0700 (PDT) From: "David Ciemiewicz (JIRA)" To: pig-dev@hadoop.apache.org Subject: [jira] Commented: (PIG-729) Use of default parallelism In-Reply-To: <590926642.1237846190707.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697641#action_12697641 ] David Ciemiewicz commented on PIG-729: -------------------------------------- Ah wait, I just read what Olga wrote again. I think there might be hybrid solution that handles both cases without having to do -param. We should add to Pig a -set option that let's us set values for things that we would "set" in our scripts. pig -set parallelism=5 is equivalent to following idiom in my pig script. set parallelism 5; Command line -set options should override explicit set statements in the pig script with a warning of the override. I think this generalized mechanism would satisfy both my desires as a developer and Olga's desire to reduce pig development team code maintenance headaches. > Use of default parallelism > -------------------------- > > Key: PIG-729 > URL: https://issues.apache.org/jira/browse/PIG-729 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.2.1 > Environment: Hadoop 0.20 > Reporter: Santhosh Srinivasan > Fix For: 0.2.1 > > > Currently, if the user does not specify the number of reduce slots using the parallel keyword, Pig lets Hadoop decide on the default number of reducers. This model worked well with dynamically allocated clusters using HOD and for static clusters where the default number of reduce slots was explicitly set. With Hadoop 0.20, a single static cluster will be shared amongst a number of queues. As a result, a common scenario is to end up with default number of reducers set to one (1). > When users migrate to Hadoop 0.20, they might see a dramatic change in the performance of their queries if they had not used the parallel keyword to specify the number of reducers. In order to mitigate such circumstances, Pig can support one of the following: > 1. Specify a default parallelism for the entire script. > This option will allow users to use the same parallelism for all operators that do not have the explicit parallel keyword. This will ensure that the scripts utilize more reducers than the default of one reducer. On the down side, due to data transformations, usually operations that are performed towards the end of the script will need smaller number of reducers compared to the operators that appear at the beginning of the script. > 2. Display a warning message for each reduce side operator that does have the use of the explicit parallel keyword. Proceed with the execution. > 3. Display an error message indicating the operator that does not have the explicit use of the parallel keyword. Stop the execution. > Other suggestions/thoughts/solutions are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.