pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Viraj Bhat (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-1081) PigCookBook use of PARALLEL keyword
Date Wed, 11 Nov 2009 00:00:28 GMT
PigCookBook use of PARALLEL keyword

                 Key: PIG-1081
                 URL: https://issues.apache.org/jira/browse/PIG-1081
             Project: Pig
          Issue Type: Bug
          Components: documentation
    Affects Versions: 0.5.0
            Reporter: Viraj Bhat
             Fix For: 0.5.0

Hi all,
 I am looking at some tips for optimizing Pig programs (Pig Cookbook) using the PARALLEL keyword.

We know that currently Pig 0.5 uses Hadoop 20 (as its default) which launches 1 reducer for
all cases. 

In this documentation we state that: <num machines> * <num reduce slots per machine>
* 0.9, this documentation was valid for HoD (Hadoop on Demand) where you are creating your
own Hadoop clusters, but if you are using:

Either the Capacity Scheduler http://hadoop.apache.org/common/docs/current/capacity_scheduler.html
or the Fair Share Scheduler http://hadoop.apache.org/common/docs/current/fair_scheduler.html
, these numbers could mean that you are using around 90% of your reducer slots in your machine.

We should change this to something like: 
The number of reducers you may need for a particular construct in Pig which forms a Map Reduce
boundary depends entirely on your data and the number of intermediate keys you are generating
in your mappers. In best cases we have seen that a reducer processing about 500 MB of data
behaves efficiently. Additionally it is hard to define the optimum number of reducers, since
it completely depends on the paritioner and distribution of map (combiner) output keys.


This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message