Mailing-List: contact pig-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: pig-dev@hadoop.apache.org
Message-ID: <167567118.1257897628647.JavaMail.jira@brutus>
Date: Wed, 11 Nov 2009 00:00:28 +0000 (UTC)
From: "Viraj Bhat (JIRA)" <jira@apache.org>
To: pig-dev@hadoop.apache.org
Subject: [jira] Created: (PIG-1081) PigCookBook use of PARALLEL keyword
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

PigCookBook use of PARALLEL keyword
-----------------------------------

                 Key: PIG-1081
                 URL: https://issues.apache.org/jira/browse/PIG-1081
             Project: Pig
          Issue Type: Bug
          Components: documentation
    Affects Versions: 0.5.0
            Reporter: Viraj Bhat
             Fix For: 0.5.0


Hi all,
 I am looking at some tips for optimizing Pig programs (Pig Cookbook) using the PARALLEL keyword.

http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html#Use+PARALLEL+Keyword 
We know that currently Pig 0.5 uses Hadoop 20 (as its default) which launches 1 reducer for all cases. 

In this documentation we state that: <num machines> * <num reduce slots per machine> * 0.9, this documentation was valid for HoD (Hadoop on Demand) where you are creating your own Hadoop clusters, but if you are using:

Either the Capacity Scheduler http://hadoop.apache.org/common/docs/current/capacity_scheduler.html or the Fair Share Scheduler http://hadoop.apache.org/common/docs/current/fair_scheduler.html , these numbers could mean that you are using around 90% of your reducer slots in your machine.

We should change this to something like: 
The number of reducers you may need for a particular construct in Pig which forms a Map Reduce boundary depends entirely on your data and the number of intermediate keys you are generating in your mappers. In best cases we have seen that a reducer processing about 500 MB of data behaves efficiently. Additionally it is hard to define the optimum number of reducers, since it completely depends on the paritioner and distribution of map (combiner) output keys.

Viraj


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.