pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-4175) PIG CROSS operation follow by STORE produces non-deterministic results each run
Date Sun, 21 Sep 2014 19:13:36 GMT

     [ https://issues.apache.org/jira/browse/PIG-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-4175:
----------------------------
    Attachment: PIG-4175-1.patch

Sure. In the mean time, I tried the script with Pig 0.14 and it produces right result. However,
we can do better since cross is using only 1 reduce. I shall use Rohini's suggestion "One
way to fix this would be to always have GFCross UDF as part of map task of the actual cross
job and never do it as part of previous job's map or reduce.". Attach patch.

> PIG CROSS operation follow by STORE produces non-deterministic results each run
> -------------------------------------------------------------------------------
>
>                 Key: PIG-4175
>                 URL: https://issues.apache.org/jira/browse/PIG-4175
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.11, 0.12.0
>         Environment: RHEL 6/64-bit
>            Reporter: Jim Huang
>         Attachments: PIG-4175-1.patch, mktestdata.py, pig_testcross_plan.png, test_cross.out,
test_cross.pig
>
>
> Three files will be attached to help visualize this issue.
> 1. mktestdata.py - to generate test data to feed the pig script
> 2. test_cross.pig - the PIG script using CROSS and STORE
> 3. test_cross.out - the PIG console output showing the input/output records delta
> To reproduce this PIG CROSS operation problem, you need to use the supplied Python script,
> mktestdata.py, to generate an input file that is at least 13,948,228,930 bytes (>
13GB).
> The CROSS between raw_data (m records) and cross_count (1 record) should yield exactly
(m records) as the output.  
> The STORE results from the CROSS operations yielded about 1/3 of input record in raw_data
as the output.  
> If I joined the both of the CROSS operations together, the STORE results from the CROSS
operations yielded about 2/3
> of the input records in raw-data as the output.  
> -- data = CROSS raw_data, field04s_count, subsection1_field04s_count, subsection2_field04s_count;
> We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 2.x) clusters.
 
> The default HDFS block size is 128MB.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message