hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Ciemiewicz (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-729) Use of default parallelism
Date Fri, 03 Apr 2009 21:37:13 GMT

    [ https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695599#action_12695599

David Ciemiewicz commented on PIG-729:

I've been through this battle before.  And I write LOTS of Pig scripts.

Here's what I want:

1) Use default parallelism of 1 reducer.  BUT WARN ME that I've got a default parallelism
of 1 reducer. (I'd actually prefer what ever works on a single node).

2) Allow me a command line option such as -parallel # or -mappers # -reducers #.

3) Allow me a set parameter inside my Pig scripts such as:

    set parallel #
    set mappers #
    set reducers #

4) DO NOT require me to add a PARALLEL clause to each and every one of my reducer statements.
PARALLEL clauses are a code maintenance nightmare. 
Sometimes the grid is fat on available nodes and so I want to take advantage of this and run
my job across as many nodes as possible.
Sometimes the grid is scarce on available nodes and so I want back off on the parallelism.

I DO NOT WANT to change EVERY PARALLEL clause in my code each time I run my script.
I DO NOT WANT to change parameter values for the PARALLEL clause each time I run my script.

I really, really, really want to make this a run-time decision on the execution of the script
at the time that I invoke the script and I want this to be the default behavior in PIg.

> Use of default parallelism
> --------------------------
>                 Key: PIG-729
>                 URL: https://issues.apache.org/jira/browse/PIG-729
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.2.1
>         Environment: Hadoop 0.20
>            Reporter: Santhosh Srinivasan
>             Fix For: 0.2.1
> Currently, if the user does not specify the number of reduce slots using the parallel
keyword, Pig lets Hadoop decide on the default number of reducers. This model worked well
with dynamically allocated clusters using HOD and for static clusters where the default number
of reduce slots was explicitly set. With Hadoop 0.20, a single static cluster will be shared
amongst a number of queues. As a result, a common scenario is to end up with default number
of reducers set to one (1).
> When users migrate to Hadoop 0.20, they might see a dramatic change in the performance
of their queries if they had not used the parallel keyword to specify the number of reducers.
In order to mitigate such circumstances, Pig can support one of the following:
> 1. Specify a default parallelism for the entire script.
> This option will allow users to use the same parallelism for all operators that do not
have the explicit parallel keyword. This will ensure that the scripts utilize more reducers
than the default of one reducer. On the down side, due to data transformations, usually operations
that are performed towards the end of the script will need smaller number of reducers compared
to the operators that appear at the beginning of the script.
> 2. Display a warning message for each reduce side operator that does have the use of
the explicit parallel keyword. Proceed with the execution.
> 3. Display an error message indicating the operator that does not have the explicit use
of the parallel keyword. Stop the execution.
> Other suggestions/thoughts/solutions are welcome.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message