hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-514) COUNT returns no results as a result of two filter statements in FOREACH
Date Tue, 21 Apr 2009 21:40:47 GMT

     [ https://issues.apache.org/jira/browse/PIG-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pradeep Kamath updated PIG-514:
-------------------------------

    Attachment: PIG-514.patch

Attached patch which implements the proposed design. The changes are spread across the following
areas:
- Parser - in QueryParser.jjt, the condition wherein a relational operator is followed by
a Project(*) now marks the Project to be a special Project which would send empty bags to
the predecessor on EOP
- In the LogToPhyTranslationVisitor, based on the presence/absence of the above flag either
a PORelationToExprProject is created or a regular POProject is created.
- PORelationToExprProject is extended from POProject and only overrides the getNext(DataBag)
method to send an empty bag on first encountering an EOP and sets state to send an EOP down
the next time it is called. However if the POForEach in which this project is present, starts
a new set of inputs, this flag is reset in the reset() method
- PhysicalOperator now has a reset() method for use in the PORelationToExprProject and in
limit/sort/distinct operators when limit is present to reset state when new input for the
foreach starts.
- The builtins (SUM/COUNT/MIN/MAX/AVG) now handle empty bags - COUNT gives 0 and the others
give null as output (this change includes the type specific implementations of these aggs
like IntSum, LongSum etc
- Test cases to test the empty bag case


> COUNT returns no results as a result of two filter statements in FOREACH
> ------------------------------------------------------------------------
>
>                 Key: PIG-514
>                 URL: https://issues.apache.org/jira/browse/PIG-514
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.2.0
>            Reporter: Viraj Bhat
>            Assignee: Pradeep Kamath
>         Attachments: mystudentfile.txt, PIG-514.patch
>
>
> For the following piece of sample code in FOREACH which counts the filtered student records
based on record_type == 1 and scores and also on record_type == 0 does not seem to return
any results.
> {code}
> mydata = LOAD 'mystudentfile.txt' AS  (record_type,name,age,scores,gpa);
> --keep only what we need
> mydata_filtered = FOREACH  mydata GENERATE   record_type,  name,  age,  scores ;
> --group
> mydata_grouped = GROUP mydata_filtered BY  (record_type,age);
> myfinaldata = FOREACH mydata_grouped {
>      myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores;
>      myfilter2 = FILTER mydata_filtered BY record_type == 0;
>      GENERATE FLATTEN(group),
> -- Only this count causes the problem ??
>       COUNT(myfilter1) as col2,
>       SUM(myfilter2.scores) as col3,
>       COUNT(myfilter2) as col4;  };
> --these set of statements confirm that the count on the  filters returns 1
> --mycountdata = FOREACH mydata_grouped
> --{
> --      myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores;
> --      GENERATE
> --      COUNT(myfilter1) as colcount;
> --};
> --dump mycountdata;
> dump myfinaldata;
> {code}
> But if you uncomment the  {code} COUNT(myfilter1) as col2, {code}, it seems to work with
the following results..
> (0,22,45.0,2L)
> (0,24,133.0,6L)
> (0,25,22.0,1L)
> Also I have tried to verify if this is a issue with the {code} COUNT(myfilter1) as col2,
{code} returning zero. It does not seem to be the case.
> If {code}  dump mycountdata; {code} is uncommented it returns:
> (1L)
> (1L)
> I am attaching the tab separated 'mystudentfile.txt' file used in this Pig script. Is
this an issue with 2 filters in the FOREACH followed by a COUNT on these filters??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message