spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Herman van Hovell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic
Date Mon, 19 Mar 2018 08:43:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404511#comment-16404511
] 

Herman van Hovell commented on SPARK-23599:
-------------------------------------------

PR 1 out of 2 has been merged.

> The UUID() expression is too non-deterministic
> ----------------------------------------------
>
>                 Key: SPARK-23599
>                 URL: https://issues.apache.org/jira/browse/SPARK-23599
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Herman van Hovell
>            Assignee: Liang-Chi Hsieh
>            Priority: Critical
>
> The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID generation.
There are a couple of major problems with this:
> - It is non-deterministic across task retries. This breaks Spark's processing model,
and this will to very hard to trace bugs, like non-deterministic shuffles, duplicates and
missing rows.
> - It uses a single secure random for UUID generation. This uses a single JVM wide lock,
and this can lead to lock contention and other performance problems.
> We should move to something that is deterministic between retries. This can be done by
using seeded PRNGs for which we set the seed during planning. It is important here to use
a PRNG that provides enough entropy for creating a proper UUID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message