hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Viraj Bhat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-514) COUNT returns no results as a result of two filter statements in FOREACH
Date Fri, 27 Mar 2009 01:08:50 GMT

    [ https://issues.apache.org/jira/browse/PIG-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689772#action_12689772
] 

Viraj Bhat commented on PIG-514:
--------------------------------

Another test case: consider the following input file:

1       1       3
1       2       3
2       1       3
2       1       3
 
The pig program is like this:
{code} 
test   = load 'test.txt' as (col1: int, col2: int, col3: int);
test2 = group test by col1;
test3 = foreach test2
{
        filter_one    = filter test by (col2==1);
        filter_notone = filter test by (col2!=1);
        generate group as col1, COUNT(filter_one) as cnt_one, COUNT(filter_notone) as cnt_notone;
};
{code}
 
The output consists of a single line:
(1,1L,1L)
 
But I would expect
(1,1L,1L)
(2,2L,0L)


> COUNT returns no results as a result of two filter statements in FOREACH
> ------------------------------------------------------------------------
>
>                 Key: PIG-514
>                 URL: https://issues.apache.org/jira/browse/PIG-514
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 1.0.0
>            Reporter: Viraj Bhat
>         Attachments: mystudentfile.txt
>
>
> For the following piece of sample code in FOREACH which counts the filtered student records
based on record_type == 1 and scores and also on record_type == 0 does not seem to return
any results.
> {code}
> mydata = LOAD 'mystudentfile.txt' AS  (record_type,name,age,scores,gpa);
> --keep only what we need
> mydata_filtered = FOREACH  mydata GENERATE   record_type,  name,  age,  scores ;
> --group
> mydata_grouped = GROUP mydata_filtered BY  (record_type,age);
> myfinaldata = FOREACH mydata_grouped {
>      myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores;
>      myfilter2 = FILTER mydata_filtered BY record_type == 0;
>      GENERATE FLATTEN(group),
> -- Only this count causes the problem ??
>       COUNT(myfilter1) as col2,
>       SUM(myfilter2.scores) as col3,
>       COUNT(myfilter2) as col4;  };
> --these set of statements confirm that the count on the  filters returns 1
> --mycountdata = FOREACH mydata_grouped
> --{
> --      myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores;
> --      GENERATE
> --      COUNT(myfilter1) as colcount;
> --};
> --dump mycountdata;
> dump myfinaldata;
> {code}
> But if you uncomment the  {code} COUNT(myfilter1) as col2, {code}, it seems to work with
the following results..
> (0,22,45.0,2L)
> (0,24,133.0,6L)
> (0,25,22.0,1L)
> Also I have tried to verify if this is a issue with the {code} COUNT(myfilter1) as col2,
{code} returning zero. It does not seem to be the case.
> If {code}  dump mycountdata; {code} is uncommented it returns:
> (1L)
> (1L)
> I am attaching the tab separated 'mystudentfile.txt' file used in this Pig script. Is
this an issue with 2 filters in the FOREACH followed by a COUNT on these filters??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message