spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cloud-fan <>
Subject [GitHub] spark pull request #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() da...
Date Wed, 15 Aug 2018 20:02:52 GMT
GitHub user cloud-fan opened a pull request:

    [WIP][SPARK-23243][Core] Fix RDD.repartition() data correctness issue

    ## What changes were proposed in this pull request?
    An alternative fix for
    RDD can take arbitrary user function, but we have an assumption: the function should produce
same data set for same input, but the order can change.
    Spark scheduler must take care of this assumption when fetch failure happens, otherwise
we may hit correctness issue as the JIRA ticket described.
    Generall speaking, when a map stage gets retried because of fetch failure, and this map
stage is not idempotent(produce same data set but different order each time), and the shuffle
partitioner is sensitive to the input data order(like round robin partitioner), we should
retry all the reduce tasks.
    TODO: document and test

You can merge this pull request into a Git repository by running:

    $ git pull repartition

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22112
commit 1f9f6e5b020038be1e7c11b9923010465da385aa
Author: Wenchen Fan <wenchen@...>
Date:   2018-08-15T18:38:24Z

    fix repartition+shuffle bug



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message