hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Zhang (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
Date Mon, 17 May 2010 16:00:50 GMT

     [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jeff Zhang updated PIG-1249:

    Attachment: PIG-1249.patch

The current idea is borrowed from hive, use the input file size to estimate the reducer number.
Two parameters can been set for this purpose
pig.exec.reducers.bytes.per.reducer  // the number of bytes of input for each reducer
pig.exec.reducers.max                          // the max number of reducer number

This only work for hdfs, won't work for other data source such as hbase or cassandra.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>         Attachments: PIG-1249.patch
> It would be *very* useful for Pig to have safe-guards against naive scripts which process
a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB)
with badly mis-configured #reduces e.g. 1 reduce. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message