hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-514) COUNT returns no results as a result of two filter statements in FOREACH
Date Sat, 01 Nov 2008 03:51:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644485#action_12644485
] 

Pradeep Kamath commented on PIG-514:
------------------------------------

The issue is that for each group in the input data, one of the filters always filters out
all data and the POFilter returns an POStatus.STATUS_EOP. The POUserFunc sees this EOP and
does not call the actual UDF (COUNT() or SUM()) and just sends the EOP to POForeach. The POForeach
sees this EOP and just finishes processing that group without outputting any results.
Ideally for COUNT() and SUM() POUserFunc should send an empty bag as input so that COUNT()
can be 0 and SUM can be null. However this issue is also present in the following code:
{code}
a = load 'bla';
b = filter a by 2 == 1; -- this is just an illustration of an aggressive filter which filters
every tuple
c = foreach b generate myudf($0);
{code}

In the above case also myudf() is never called - is it ok to not call the udf when there is
no input to give it (EOP case)? This causes queries like the one in the description to not
give the correct COUNT of 0 and SUM of null in cases where the input to them is empty - we
need to decide how we should handle this general case (both for aggregate functions like COUNTs
and non aggregate functions like myudf())

One other case of the COUNT problem is:
{code}
a = load 'emptyfile'; -- load an empty file
-- neither of the statements below actually ever get executed
b = group a all;
c = foreach b generate COUNT(a);
{code}
When the input data is empty, neither map() nor reduce() gets executed and hence COUNT() never
gets called.


> COUNT returns no results as a result of two filter statements in FOREACH
> ------------------------------------------------------------------------
>
>                 Key: PIG-514
>                 URL: https://issues.apache.org/jira/browse/PIG-514
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Viraj Bhat
>             Fix For: types_branch
>
>         Attachments: mystudentfile.txt
>
>
> For the following piece of sample code in FOREACH which counts the filtered student records
based on record_type == 1 and scores and also on record_type == 0 does not seem to return
any results.
> {code}
> mydata = LOAD 'mystudentfile.txt' AS  (record_type,name,age,scores,gpa);
> --keep only what we need
> mydata_filtered = FOREACH  mydata GENERATE   record_type,  name,  age,  scores ;
> --group
> mydata_grouped = GROUP mydata_filtered BY  (record_type,age);
> myfinaldata = FOREACH mydata_grouped {
>      myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores;
>      myfilter2 = FILTER mydata_filtered BY record_type == 0;
>      GENERATE FLATTEN(group),
> -- Only this count causes the problem ??
>       COUNT(myfilter1) as col2,
>       SUM(myfilter2.scores) as col3,
>       COUNT(myfilter2) as col4;  };
> --these set of statements confirm that the count on the  filters returns 1
> --mycountdata = FOREACH mydata_grouped
> --{
> --      myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores;
> --      GENERATE
> --      COUNT(myfilter1) as colcount;
> --};
> --dump mycountdata;
> dump myfinaldata;
> {code}
> But if you uncomment the  {code} COUNT(myfilter1) as col2, {code}, it seems to work with
the following results..
> (0,22,45.0,2L)
> (0,24,133.0,6L)
> (0,25,22.0,1L)
> Also I have tried to verify if this is a issue with the {code} COUNT(myfilter1) as col2,
{code} returning zero. It does not seem to be the case.
> If {code}  dump mycountdata; {code} is uncommented it returns:
> (1L)
> (1L)
> I am attaching the tab separated 'mystudentfile.txt' file used in this Pig script. Is
this an issue with 2 filters in the FOREACH followed by a COUNT on these filters??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message