pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-5040) Order by and CROSS partitioning is not deterministic due to usage of Random
Date Fri, 14 Oct 2016 23:18:21 GMT

    [ https://issues.apache.org/jira/browse/PIG-5040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576852#comment-15576852
] 

Rohini Palaniswamy commented on PIG-5040:
-----------------------------------------

RandomSampleLoader and POReservoirSample even though they use new Random() and rerun will
produce different samples, it is not a problem because the output is always sent to a single
reducer (sample aggregator). If there were more than one reducer, then it is a problem.

> Order by and CROSS partitioning is not deterministic due to usage of Random
> ---------------------------------------------------------------------------
>
>                 Key: PIG-5040
>                 URL: https://issues.apache.org/jira/browse/PIG-5040
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>            Priority: Critical
>             Fix For: 0.17.0, 0.16.1
>
>         Attachments: PIG-5040-1-nowhitespacechanges.patch, PIG-5040-1.patch
>
>
> Maps can be rerun due to shuffle fetch failures. Half of the reducers can end up successfully
pulling partitions from first run of the map while other half could pull from the rerun after
shuffle fetch failures. If the data is not partitioned by the Partitioner exactly the same
way every time then it could lead to incorrect results (loss of records and duplicated records).
Even though issue has existed for 8 years now with order by and affects mapreduce as well
found this with Tez where the frequency of rerun due to shuffle fetch failures is high (Order
by partitioner gets its data from a 1-1 edge, so there are no retries and shuffle fetch failures
trigger a rerun immediately).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message