hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1301) Problem pruning columns with UDF
Date Tue, 16 Mar 2010 19:36:27 GMT

    [ https://issues.apache.org/jira/browse/PIG-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846087#action_12846087
] 

Daniel Dai commented on PIG-1301:
---------------------------------

Thanks for reporting. I tried the script on trunk. Seems on trunk we have fixed that. The
code you mentioned do have the problem you mentioned. But in the trunk, we already changed
the code to:

{code}
boolean anyPruned = false;
for (LOProject loProject : projectFinder.getProjectSet()) {
    Pair<Integer, Integer> pair = new Pair<Integer, Integer>(0, loProject.getCol());
    if (columns.contains(pair)) {
        anyPruned = true;
        break;
    }
}
{code}

The fix will come with next Pig release. 

> Problem pruning columns with UDF
> --------------------------------
>
>                 Key: PIG-1301
>                 URL: https://issues.apache.org/jira/browse/PIG-1301
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Andrew Groh
>             Fix For: 0.7.0
>
>
> I just upgraded to pig 0.6.0.
> I have a pig file like
> raw = load 'foo.csv' using PigStorage() as (field1:chararray, field2:chararray);
> define contains com.mycompany.pig.Contains();
> rawactions = foreach raw generate contains(field1, field2) as junk,  field1;
> reqcnt = foreach rawactions generate field1;
> dump reqcnt
> When I try to run this code, I get an error:
> Problem with input: (Name: Project 1-40 Projections: [1] Overloaded: false Operator Key:
1-40) of User-defined function: (Name: UserFunc 1-39 function: com.mycompany.pig.Contains
Operator Key: 1-39)
> Thrown from line 98 of LOUserFunction.java
> This was caused by another FrontEndException 
> Attempt to access field: 1 from schema: {field1: chararray}
> from Schema.java
> I also investigated changing the pig code
> if you change
> rawactions = foreach raw generate contains(field1, field2) as junk,  field1;
> to either
> rawactions = foreach raw generate contains(field2, field2) as junk,  field1;
> or
> rawactions = foreach raw generate contains(field2, field2) as junk,  field1;
> or if you change
> reqcnt = foreach rawactions generate field1;
> to
> reqcnt = foreach rawactions generate field1, junk;
> It all works correctly.
> The problem appears to be that it prunes out field2, but then gets confused and does
not prune out the plan associated with the UDF contains, since field1 is not pruned.  So if
the UDF only references field2 it will get removed, if it only references field1 the field
will have not been pruned and it can run.
> I eventually tracked this down to the code around 947 of LOForEach.java
>             for (LOProject loProject : projectFinder.getProjectSet()) {
>                 Pair<Integer, Integer> pair = new Pair<Integer, Integer>(0,
>                         loProject.getCol());
>                 if (!columns.contains(pair)) {
>                     allPruned = false;
>                     break;
>                 }
>             }
>             if (allPruned) {
>                 planToRemove.add(i);
>             }
> In the example pig, allPruned is false for the plan associated the UDF.  This is because
field1 is both a column for the UDF and for the ForEach in general.  Since field1 is not pruned,
the plan is not removed and bad things happen later.
> I don't really understand the pruning code all that well, so I don't have a fix for it.
 I hope that it will be clear to someone who understands this code better.  I can provide
a better test case for this if necessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message