hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Groh (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-1301) Problem pruning columns with UDF
Date Tue, 16 Mar 2010 15:06:27 GMT
Problem pruning columns with UDF
--------------------------------

                 Key: PIG-1301
                 URL: https://issues.apache.org/jira/browse/PIG-1301
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: 0.6.0
            Reporter: Andrew Groh


I just upgraded to pig 0.6.0.

I have a pig file like
raw = load 'foo.csv' using PigStorage() as (field1:chararray, field2:chararray);

define contains com.mycompany.pig.Contains();

rawactions = foreach raw generate contains(field1, field2) as junk,  field1;

reqcnt = foreach rawactions generate field1;

dump reqcnt

When I try to run this code, I get an error:
Problem with input: (Name: Project 1-40 Projections: [1] Overloaded: false Operator Key: 1-40)
of User-defined function: (Name: UserFunc 1-39 function: com.mycompany.pig.Contains Operator
Key: 1-39)
Thrown from line 98 of LOUserFunction.java

This was caused by another FrontEndException 
Attempt to access field: 1 from schema: {field1: chararray}
from Schema.java

I also investigated changing the pig code
if you change
rawactions = foreach raw generate contains(field1, field2) as junk,  field1;

to either
rawactions = foreach raw generate contains(field2, field2) as junk,  field1;
or
rawactions = foreach raw generate contains(field2, field2) as junk,  field1;

or if you change
reqcnt = foreach rawactions generate field1;
to
reqcnt = foreach rawactions generate field1, junk;

It all works correctly.

The problem appears to be that it prunes out field2, but then gets confused and does not prune
out the plan associated with the UDF contains, since field1 is not pruned.  So if the UDF
only references field2 it will get removed, if it only references field1 the field will have
not been pruned and it can run.

I eventually tracked this down to the code around 947 of LOForEach.java
            for (LOProject loProject : projectFinder.getProjectSet()) {
                Pair<Integer, Integer> pair = new Pair<Integer, Integer>(0,
                        loProject.getCol());
                if (!columns.contains(pair)) {
                    allPruned = false;
                    break;
                }
            }
            if (allPruned) {
                planToRemove.add(i);
            }

In the example pig, allPruned is false for the plan associated the UDF.  This is because field1
is both a column for the UDF and for the ForEach in general.  Since field1 is not pruned,
the plan is not removed and bad things happen later.

I don't really understand the pruning code all that well, so I don't have a fix for it.  I
hope that it will be clear to someone who understands this code better.  I can provide a better
test case for this if necessary.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message