spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matei Zaharia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
Date Mon, 01 Sep 2014 18:25:21 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117622#comment-14117622
] 

Matei Zaharia commented on SPARK-3098:
--------------------------------------

It's true that the ordering of values after a shuffle is nondeterministic, so that for example
on failure you might get a different order of keys in a reduceByKey or distinct or operations
like that. However, I think that's the way it should be (and we can document it). RDDs are
deterministic when viewed as a multiset, but not when viewed as an ordered collection, unless
you do sortByKey. Operations like zipWithIndex are meant to be more of a convenience to get
unique IDs or act on something with a known ordering (such as a text file where you want to
know the line numbers). But the freedom to control fetch ordering is quite important for performance,
especially if you want to have a push-based shuffle in the future.

If we wanted to get the same result every time, we could design reduce tasks to tell the master
the order they fetched stuff in after the first time they ran, but even then, notice that
it might limit the kind of shuffle mechanisms we allow (e.g. it would be harder to make a
push-based shuffle deterministic). I'd rather not make that guarantee now.

>  In some cases, operation zipWithIndex get a wrong results
> ----------------------------------------------------------
>
>                 Key: SPARK-3098
>                 URL: https://issues.apache.org/jira/browse/SPARK-3098
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.1
>            Reporter: Guoqiang Li
>            Priority: Critical
>
> The reproduce code:
> {code}
>      val c = sc.parallelize(1 to 7899).flatMap { i =>
>       (1 to 10000).toSeq.map(p => i * 6000 + p)
>     }.distinct().zipWithIndex() 
>     c.join(c).filter(t => t._2._1 != t._2._2).take(3)
> {code}
>  => 
> {code}
>  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message