hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Corinne Chandel (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1081) PigCookBook use of PARALLEL keyword
Date Thu, 03 Dec 2009 23:26:21 GMT

    [ https://issues.apache.org/jira/browse/PIG-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785626#action_12785626
] 

Corinne Chandel commented on PIG-1081:
--------------------------------------

Discussed with Viraj.

Documentation changes made.

Changes included in pig-6.patch attached to PIG-1084.

https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12440363

> PigCookBook use of PARALLEL keyword
> -----------------------------------
>
>                 Key: PIG-1081
>                 URL: https://issues.apache.org/jira/browse/PIG-1081
>             Project: Pig
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 0.5.0
>            Reporter: Viraj Bhat
>             Fix For: 0.5.0
>
>
> Hi all,
>  I am looking at some tips for optimizing Pig programs (Pig Cookbook) using the PARALLEL
keyword.
> http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html#Use+PARALLEL+Keyword 
> We know that currently Pig 0.5 uses Hadoop 20 (as its default) which launches 1 reducer
for all cases. 
> In this documentation we state that: <num machines> * <num reduce slots per
machine> * 0.9, this documentation was valid for HoD (Hadoop on Demand) where you are creating
your own Hadoop clusters, but if you are using:
> Either the Capacity Scheduler http://hadoop.apache.org/common/docs/current/capacity_scheduler.html
or the Fair Share Scheduler http://hadoop.apache.org/common/docs/current/fair_scheduler.html
, these numbers could mean that you are using around 90% of your reducer slots in your machine.
> We should change this to something like: 
> The number of reducers you may need for a particular construct in Pig which forms a Map
Reduce boundary depends entirely on your data and the number of intermediate keys you are
generating in your mappers. In best cases we have seen that a reducer processing about 500
MB of data behaves efficiently. Additionally it is hard to define the optimum number of reducers,
since it completely depends on the paritioner and distribution of map (combiner) output keys.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message