pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
Date Fri, 20 Aug 2010 21:11:17 GMT

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Olga Natkovich updated PIG-1249:

    Release Note: 
In the previous versions of Pig, if the number of reducers was not specified (via PARALLEL
or default_parallelism), the value of 1 was used which in many cases was not a good choice
and caused severe performance problems.

In Pig 0.8.0, a simple heuristic is used to come up with a better number based on the size
of the input data. There are several parameters that the user can control:

pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; default value
is 1000*1000*1000 (1GB)
pig.exec.reducers.max - defines the upper bound on the number of reducers; default is 999

The formula is very simple:

#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer.

This is a very simplistic formula that we would need to improve over time. Note, that the
computed value takes all inputs within the script into account and applies the computed value
to all the jobs within Pig script.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, PIG_1249_2.patch,
> It would be *very* useful for Pig to have safe-guards against naive scripts which process
a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB)
with badly mis-configured #reduces e.g. 1 reduce. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message