hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Viraj Bhat (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-739) Filter in foreach seems to drop records resulting in decreased count of records
Date Mon, 30 Mar 2009 20:37:51 GMT
Filter in foreach seems to drop records resulting in decreased count of records
-------------------------------------------------------------------------------

                 Key: PIG-739
                 URL: https://issues.apache.org/jira/browse/PIG-739
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: 0.3.0
            Reporter: Viraj Bhat
             Fix For: 0.3.0


I have a Pig script in which I count the number of distinct records resulting from the filter,
this statement is embedded in a foreach. The number of records I get with alias  TESTDATA_AGG_2
is 1.

{code}
TESTDATA =  load 'testdata' using PigStorage() as (timestamp:chararray, testid:chararray,
userid: chararray, sessionid:chararray, value:long, flag:int);

TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and timestamp lt '1230804000000'
and value != 0);

TESTDATA_GROUP = group TESTDATA_FILTERED by testid;

TESTDATA_AGG = foreach TESTDATA_GROUP {
                        A = filter TESTDATA_FILTERED by (userid eq sessionid);
                        C = distinct A.userid;
                        generate group as testid, COUNT(TESTDATA_FILTERED) as counttestdata,
COUNT(C) as distcount, SUM(TESTDATA_FILTERED.flag) as total_flags;
                }

TESTDATA_AGG_1 = group TESTDATA_AGG ALL;

-- count records generated through nested foreach which contains distinct
TESTDATA_AGG_2 = foreach TESTDATA_AGG_1 generate COUNT(TESTDATA_AGG);

--explain TESTDATA_AGG_2;
dump TESTDATA_AGG_2;
--RESULT (1L)
{code}

But when I do the counting of records without the filter and distinct in the foreach I get
a different value (20L)

{code}

TESTDATA =  load 'testdata' using PigStorage() as (timestamp:chararray, testid:chararray,
userid: chararray, sessionid:chararray, value:long, flag:int);

TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and timestamp lt '1230804000000'
and value != 0);

TESTDATA_GROUP = group TESTDATA_FILTERED by testid;

-- count records generated through simple foreach
TESTDATA_AGG2 = foreach TESTDATA_GROUP generate group as testid, COUNT(TESTDATA_FILTERED)
as counttestid, SUM(TESTDATA_FILTERED.flag) as total_flags;

TESTDATA_AGG2_1 = group TESTDATA_AGG2 ALL;
TESTDATA_AGG2_2 = foreach TESTDATA_AGG2_1 generate COUNT(TESTDATA_AGG2);
dump TESTDATA_AGG2_2;
--RESULT (20L)
{code}

Attaching testdata

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message