pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacob Perkins (JIRA)" <j...@apache.org>
Subject [jira] [Created] (PIG-2014) SAMPLE shouldn't be pushed up
Date Tue, 26 Apr 2011 14:37:04 GMT
SAMPLE shouldn't be pushed up
-----------------------------

                 Key: PIG-2014
                 URL: https://issues.apache.org/jira/browse/PIG-2014
             Project: Pig
          Issue Type: Bug
            Reporter: Jacob Perkins


Consider the following code:

{code:none}
tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, weight:double);
grouped   = GROUP tfidf_all BY doc_id;
vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, weight) AS vector;
DUMP vectors;
{code}

This, of course, runs just fine. In a real example, tfidf_all contains 1,428,280 records.
The reduce output records should be exactly the number of documents, which turn out to be
18,863 in this case. All well and good.

The strangeness comes when you add a SAMPLE command:

{code:none}
sampled = SAMPLE vectors 0.0012;
DUMP sampled;
{code}

Running this results in 1,513 reduce output records. The reduce output records be much much
closer to 22 or 23 records (eg. 0.0012*18863).

Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in front of the group.
It shouldn't push that filter  
since the UDF is non-deterministic.  

Quick fix: If you add "-t PushUpFilter" to your command line when invoking pig this won't
happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message