hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
Date Thu, 29 Jul 2010 00:29:17 GMT

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893446#action_12893446
] 

Olga Natkovich commented on PIG-1249:
-------------------------------------

Comments for the documentation:

+    /**
+     * Currently the estimation of reducer number is only applied to HDFS, The estimation
is based on the input size of data storage on HDFS.
+     * Two parameters can been configured for the estimation, one is pig.exec.reducers.max
which constrain the maximum number of reducer task (default is 999). The other
+     * is pig.exec.reducers.bytes.per.reducer(default value is 1000*1000*1000) which means
the how much data can been handled for each reducer.
+     * e.g. the following is your pig script
+     * a = load '/data/a';
+     * b = load '/data/b';
+     * c = join a by $0, b by $0;
+     * store c into '/tmp';
+     *
+     * The size of /data/a is 1000*1000*1000, and size of /data/b is 2*1000*1000*1000.
+     * Then the estimated reducer number is (1000*1000*1000+2*1000*1000*1000)/(1000*1000*1000)=3


> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, PIG_1249_2.patch,
PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process
a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB)
with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message