hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
Date Tue, 25 May 2010 20:59:33 GMT

    [ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871358#action_12871358
] 

Alan Gates commented on PIG-1249:
---------------------------------

1. In this code, what happens if a loader is not loading from a file (like an HBase loader)?
It looks to me like it will end up throwing an IOException when it tries to stat the 'file'
which won't exist and that will cause Pig to die. Ideally in this case it should decide that
it cannot make a rational estimate and not try to estimate.

{color:blue}
It won't throw IOException when file doesn't exit, getTotalInputFileSize will return 0 if
not loading from file or file doesn't exit. And the final estimated reducer number will be
1.

{color}
{color:red}
Could we add a test to test this?  I think it would be good to assure it works in this situation.
 Maybe you could take one of the tests that uses the Hbase loader.
{color}

2. I'm curious where the values of ~1GB per reducer and 999 reducers came from.

{color:blue}
These two numbers is what Hive use, I'm not sure how they came from. Maybe from their experience.
{color}
{color:red}
ok, good enough.  We can adjust them later if we need to.
{color}

3. Does this estimate apply only to the first job or to all jobs?

{color:blue}
It will apply to all the jobs
{color}
{color:red}
Eventually we should change this to do the estimation on the fly in the JobControlCompiler.
 Since most queries tend to aggregate data down after a number of steps I suspect that using
the initial input to estimate the entire query will mean that the final results are parallelized
too widely.  But this is better than the current situation where they aren't parallelized
at all.
{color}

4. How does this work in the case of joins, where there are multiple inputs to a job?

{color:blue}
it will estimate the reducer number according the all the inputs files' size
{color}
{color:red}
cool
{color}

So other than testing the non-file case I'm +1 on this patch.


> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch, PIG_1249_2.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts which process
a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge data-sets (>10TB)
with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message