hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Viraj Bhat (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-594) Inconsistent behaviour of FilterFunc UDF when used in the Filter and ForEach statements
Date Tue, 30 Dec 2008 23:38:44 GMT

     [ https://issues.apache.org/jira/browse/PIG-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Viraj Bhat updated PIG-594:
---------------------------

    Description: 
I have a UDF known as INSETFROMFILE, which matches data against a set of values stored in
an HDFS file. The  INSETFROMFILE extends FilterFunc. Here is a sample pig script which uses
it.
{code}
register util.jar;
define InQuerySet util.INSETFROMFILE('/user/viraj/insetfilterfile');
A = load '/user/viraj/myurldata.txt' using PigStorage() as (url, bcookie);
B = group A by (url);
C = foreach B generate ((InQuerySet(A.bcookie))?1:0) as inset, A;
dump C;
{code}
This script fails with the following exception in the reducer:
================================================================================================================
java.lang.NullPointerException 
 at org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties(ConfigurationUtil.java:45)
        at util.INSETFROMFILE.init(INSETFROMFILE.java:79)
        at util.INSETFROMFILE.exec(INSETFROMFILE.java:99)
        at util.INSETFROMFILE.exec(INSETFROMFILE.java:61)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:185)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:223)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.getNext(POBinCond.java:92)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:259)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:197)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:280)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:247)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:224)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:136)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
================================================================================================================
To avoid this error we use the INSETFROMFILE UDF in the Filter statement of Pig and it works.
{code}
register util.jar;
define InQuerySet util.INSETFROMFILE('/user/viraj/insetfilterfile');
A = load '/user/viraj/myurldata.txt' using PigStorage() as (url, bcookie);
B = filter A by InQuerySet(bcookie);
dump B;
{code}
The result is:
(www.yahoo.com,12344)

Problems:
1) Why does the FilterFunc UDF, INSETFROMFILE show inconsistent behaviour when used in the
FOREACH?
2) Is there a rule that FilterFunc UDF should be used in Filter statement?
3) Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob) is null when the
FilterFunc UDF is called within ForEach

Attaching data and script file for testing.

  was:
I have a UDF known as INSETFROMFILE, which matches data against a set of values stored in
an HDFS file. The  INSETFROMFILE extends FilterFunc. Here is a sample pig script which uses
it.
{code}
register util.jar;
define InQuerySet util.INSETFROMFILE('/user/viraj/insetfilterfile');
A = load '/user/viraj/myurldata.txt' using PigStorage() as (url, bcookie);
B = group A by (url);
C = foreach B generate ((InQuerySet(A.bcookie))?1:0) as inset, A;
dump C;
{code}
This script fails with the following exception in the reducer:
================================================================================================================
  at org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties(ConfigurationUtil.java:45)
        at util.INSETFROMFILE.init(INSETFROMFILE.java:79)
        at util.INSETFROMFILE.exec(INSETFROMFILE.java:99)
        at util.INSETFROMFILE.exec(INSETFROMFILE.java:61)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:185)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:223)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.getNext(POBinCond.java:92)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:259)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:197)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:280)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:247)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:224)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:136)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
================================================================================================================
To avoid this error we use the INSETFROMFILE UDF in the Filter statement of Pig and it works.
{code}
register util.jar;
define InQuerySet util.INSETFROMFILE('/user/viraj/insetfilterfile');
A = load '/user/viraj/myurldata.txt' using PigStorage() as (url, bcookie);
B = filter A by InQuerySet(bcookie);
dump B;
{code}
The result is:
(www.yahoo.com,12344)

Problems:
1) Why does the FilterFunc UDF, INSETFROMFILE show inconsistent behaviour when used in the
FOREACH?
2) Is there a rule that FilterFunc UDF should be used in Filter statement?
3) Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob) is null when the
FilterFunc UDF is called within ForEach

Attaching data and script file for testing.


> Inconsistent behaviour of FilterFunc UDF when used in the Filter and ForEach statements
> ---------------------------------------------------------------------------------------
>
>                 Key: PIG-594
>                 URL: https://issues.apache.org/jira/browse/PIG-594
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Viraj Bhat
>             Fix For: types_branch
>
>
> I have a UDF known as INSETFROMFILE, which matches data against a set of values stored
in an HDFS file. The  INSETFROMFILE extends FilterFunc. Here is a sample pig script which
uses it.
> {code}
> register util.jar;
> define InQuerySet util.INSETFROMFILE('/user/viraj/insetfilterfile');
> A = load '/user/viraj/myurldata.txt' using PigStorage() as (url, bcookie);
> B = group A by (url);
> C = foreach B generate ((InQuerySet(A.bcookie))?1:0) as inset, A;
> dump C;
> {code}
> This script fails with the following exception in the reducer:
> ================================================================================================================
> java.lang.NullPointerException 
>  at org.apache.pig.backend.hadoop.datastorage.ConfigurationUtil.toProperties(ConfigurationUtil.java:45)
>         at util.INSETFROMFILE.init(INSETFROMFILE.java:79)
>         at util.INSETFROMFILE.exec(INSETFROMFILE.java:99)
>         at util.INSETFROMFILE.exec(INSETFROMFILE.java:61)
>         at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:185)
>         at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:223)
>         at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.getNext(POBinCond.java:92)
>         at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:259)
>         at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:197)
>         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:280)
>         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:247)
>         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:224)
>         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:136)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
>         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
> ================================================================================================================
> To avoid this error we use the INSETFROMFILE UDF in the Filter statement of Pig and it
works.
> {code}
> register util.jar;
> define InQuerySet util.INSETFROMFILE('/user/viraj/insetfilterfile');
> A = load '/user/viraj/myurldata.txt' using PigStorage() as (url, bcookie);
> B = filter A by InQuerySet(bcookie);
> dump B;
> {code}
> The result is:
> (www.yahoo.com,12344)
> Problems:
> 1) Why does the FilterFunc UDF, INSETFROMFILE show inconsistent behaviour when used in
the FOREACH?
> 2) Is there a rule that FilterFunc UDF should be used in Filter statement?
> 3) Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob) is null when
the FilterFunc UDF is called within ForEach
> Attaching data and script file for testing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message