pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-2014) SAMPLE shouldn't be pushed up
Date Wed, 11 May 2011 15:31:47 GMT

     [ https://issues.apache.org/jira/browse/PIG-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dmitriy V. Ryaboy updated PIG-2014:
-----------------------------------

    Attachment: PIG-2014.2.patch

This addresses PushUpFilter and FilterAboveForeach, and fixes the SAMPLE issue.

I didn't tackle PushDownForeachFlatten -- there's a lot going on there and I'm not sure I
understand it all. We should open a separate ticket for making sure that optimization does
not break on nondeterministic operations.

> SAMPLE shouldn't be pushed up
> -----------------------------
>
>                 Key: PIG-2014
>                 URL: https://issues.apache.org/jira/browse/PIG-2014
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.0, 0.10
>            Reporter: Jacob Perkins
>            Assignee: Dmitriy V. Ryaboy
>             Fix For: 0.9.0
>
>         Attachments: PIG-2014.2.patch, PIG-2014.patch
>
>
> Consider the following code:
> {code:none}
> tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, weight:double);
> grouped   = GROUP tfidf_all BY doc_id;
> vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, weight) AS vector;
> DUMP vectors;
> {code}
> This, of course, runs just fine. In a real example, tfidf_all contains 1,428,280 records.
The reduce output records should be exactly the number of documents, which turn out to be
18,863 in this case. All well and good.
> The strangeness comes when you add a SAMPLE command:
> {code:none}
> sampled = SAMPLE vectors 0.0012;
> DUMP sampled;
> {code}
> Running this results in 1,513 reduce output records. The reduce output records be much
much closer to 22 or 23 records (eg. 0.0012*18863).
> Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in front of the
group. It shouldn't push that filter  
> since the UDF is non-deterministic.  
> Quick fix: If you add "-t PushUpFilter" to your command line when invoking pig this won't
happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message