Return-Path: X-Original-To: apmail-pig-dev-archive@www.apache.org Delivered-To: apmail-pig-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 08C561A0E for ; Tue, 26 Apr 2011 14:37:46 +0000 (UTC) Received: (qmail 19028 invoked by uid 500); 26 Apr 2011 14:37:45 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 18997 invoked by uid 500); 26 Apr 2011 14:37:45 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 18989 invoked by uid 500); 26 Apr 2011 14:37:44 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 18982 invoked by uid 99); 26 Apr 2011 14:37:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Apr 2011 14:37:44 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Apr 2011 14:37:42 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 1867DB4DBD for ; Tue, 26 Apr 2011 14:37:04 +0000 (UTC) Date: Tue, 26 Apr 2011 14:37:04 +0000 (UTC) From: "Jacob Perkins (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: <84798397.2604.1303828624096.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Created] (PIG-2014) SAMPLE shouldn't be pushed up MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org SAMPLE shouldn't be pushed up ----------------------------- Key: PIG-2014 URL: https://issues.apache.org/jira/browse/PIG-2014 Project: Pig Issue Type: Bug Reporter: Jacob Perkins Consider the following code: {code:none} tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, weight:double); grouped = GROUP tfidf_all BY doc_id; vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, weight) AS vector; DUMP vectors; {code} This, of course, runs just fine. In a real example, tfidf_all contains 1,428,280 records. The reduce output records should be exactly the number of documents, which turn out to be 18,863 in this case. All well and good. The strangeness comes when you add a SAMPLE command: {code:none} sampled = SAMPLE vectors 0.0012; DUMP sampled; {code} Running this results in 1,513 reduce output records. The reduce output records be much much closer to 22 or 23 records (eg. 0.0012*18863). Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in front of the group. It shouldn't push that filter since the UDF is non-deterministic. Quick fix: If you add "-t PushUpFilter" to your command line when invoking pig this won't happen. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira