hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Viraj Bhat (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-594) Inconsistent behaviour of FilterFunc UDF when used in the Filter and ForEach statements
Date Tue, 30 Dec 2008 23:36:44 GMT
Inconsistent behaviour of FilterFunc UDF when used in the Filter and ForEach statements
---------------------------------------------------------------------------------------

                 Key: PIG-594
                 URL: https://issues.apache.org/jira/browse/PIG-594
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: types_branch
            Reporter: Viraj Bhat
             Fix For: types_branch


I have a UDF known as INSETFROMFILE, which matches data against a set of values stored in
an HDFS file. The  INSETFROMFILE extends FilterFunc. Here is a sample pig script which uses
it.
{code}
register util.jar;
define InQuerySet util.INSETFROMFILE('/user/viraj/insetfilterfile');
A = load '/user/viraj/myurldata.txt' using PigStorage() as (url, bcookie);
B = group A by (url);
C = foreach B generate ((InQuerySet(A.bcookie))?1:0) as inset, A;
dump C;
{code}
This script fails with the following exception in the reducer:
================================================================================================================
  at org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties(ConfigurationUtil.java:45)
        at util.INSETFROMFILE.init(INSETFROMFILE.java:79)
        at util.INSETFROMFILE.exec(INSETFROMFILE.java:99)
        at util.INSETFROMFILE.exec(INSETFROMFILE.java:61)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:185)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:223)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.getNext(POBinCond.java:92)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:259)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:197)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:280)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:247)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:224)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:136)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
================================================================================================================
To avoid this error we use the INSETFROMFILE UDF in the Filter statement of Pig and it works.
{code}
register util.jar;
define InQuerySet util.INSETFROMFILE('/user/viraj/insetfilterfile');
A = load '/user/viraj/myurldata.txt' using PigStorage() as (url, bcookie);
B = filter A by InQuerySet(bcookie);
dump B;
{code}
The result is:
(www.yahoo.com,12344)

Problems:
1) Why does the FilterFunc UDF, INSETFROMFILE show inconsistent behaviour when used in the
FOREACH?
2) Is there a rule that FilterFunc UDF should be used in Filter statement?
3) Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob) is null when the
FilterFunc UDF is called within ForEach

Attaching data and script file for testing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message