Return-Path: X-Original-To: apmail-pig-dev-archive@www.apache.org Delivered-To: apmail-pig-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B26771111E for ; Sun, 21 Sep 2014 19:13:36 +0000 (UTC) Received: (qmail 79275 invoked by uid 500); 21 Sep 2014 19:13:36 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 79213 invoked by uid 500); 21 Sep 2014 19:13:36 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 79200 invoked by uid 500); 21 Sep 2014 19:13:36 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 79197 invoked by uid 99); 21 Sep 2014 19:13:36 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 21 Sep 2014 19:13:36 +0000 Date: Sun, 21 Sep 2014 19:13:36 +0000 (UTC) From: "Daniel Dai (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (PIG-4175) PIG CROSS operation follow by STORE produces non-deterministic results each run MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PIG-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-4175: ---------------------------- Attachment: PIG-4175-1.patch Sure. In the mean time, I tried the script with Pig 0.14 and it produces right result. However, we can do better since cross is using only 1 reduce. I shall use Rohini's suggestion "One way to fix this would be to always have GFCross UDF as part of map task of the actual cross job and never do it as part of previous job's map or reduce.". Attach patch. > PIG CROSS operation follow by STORE produces non-deterministic results each run > ------------------------------------------------------------------------------- > > Key: PIG-4175 > URL: https://issues.apache.org/jira/browse/PIG-4175 > Project: Pig > Issue Type: Bug > Affects Versions: 0.11, 0.12.0 > Environment: RHEL 6/64-bit > Reporter: Jim Huang > Attachments: PIG-4175-1.patch, mktestdata.py, pig_testcross_plan.png, test_cross.out, test_cross.pig > > > Three files will be attached to help visualize this issue. > 1. mktestdata.py - to generate test data to feed the pig script > 2. test_cross.pig - the PIG script using CROSS and STORE > 3. test_cross.out - the PIG console output showing the input/output records delta > To reproduce this PIG CROSS operation problem, you need to use the supplied Python script, > mktestdata.py, to generate an input file that is at least 13,948,228,930 bytes (> 13GB). > The CROSS between raw_data (m records) and cross_count (1 record) should yield exactly (m records) as the output. > The STORE results from the CROSS operations yielded about 1/3 of input record in raw_data as the output. > If I joined the both of the CROSS operations together, the STORE results from the CROSS operations yielded about 2/3 > of the input records in raw-data as the output. > -- data = CROSS raw_data, field04s_count, subsection1_field04s_count, subsection2_field04s_count; > We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 2.x) clusters. > The default HDFS block size is 128MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)