spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JoshRosen <>
Subject [GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...
Date Thu, 18 May 2017 07:20:50 GMT
Github user JoshRosen commented on a diff in the pull request:
    --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala ---
    @@ -278,4 +278,21 @@ package object config {
    +  private[spark] val SHUFFLE_ACCURATE_BLOCK_THRESHOLD =
    +    ConfigBuilder("spark.shuffle.accurateBlkThreshold")
    +      .doc("When we compress the size of shuffle blocks in HighlyCompressedMapStatus,
we will " +
    +        "record the size accurately if it's above the threshold specified by this config.
This " +
    --- End diff --
    One edge-case to consider is the situation where every shuffle block is _just_ over this
threshold: in this case `HighlyCompressedMapStatus` won't really be doing any compression.
    Does it make sense to compare to the average and capture the sizes of blocks which are
more than some percent / threshold above the average? The number of such blocks will probably
be smaller and this might help to avoid worst-case behaviors or excessive bloating of the
map output status sizes were someone to set this configuration too low.

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message