spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Assigned] (SPARK-14289) Add support to multiple eviction strategies for cached RDD partitions
Date Tue, 05 Apr 2016 01:20:25 GMT


Apache Spark reassigned SPARK-14289:

    Assignee: Apache Spark

> Add support to multiple eviction strategies for cached RDD partitions
> ---------------------------------------------------------------------
>                 Key: SPARK-14289
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Block Manager, Spark Core
>         Environment: Spark 2.0-SNAPSHOT
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 3
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>            Reporter: Yuanzhen Geng
>            Assignee: Apache Spark
>            Priority: Minor
> Currently, there is only eviction strategy for cached RDD partition in Spark. 
> The default RDD eviction strategy is LRU (with an additional rule that do not replacing
another block that belongs to the same RDD like current creating partition).
> When memory space not sufficient for RDD caching, several partitions will be evicted,
if these partitions are used again latterly, they will be reproduced by the Lineage information
and cached in memory again. The reproduce phase will bring in additional cost. However, LRU
has no guarantee for the lowest reproduce cost. 
> The first RDD that needed to be cached is usually generated by reading from HDFS and
doing several transformations. The reading operation usually cost longer time than other Spark
> For example, in one stage we having the following DAG structure: hdfs -> \[A\] ->
B -> \[C\] -> D - > \[E\] -> \[F\], RDD A, C, E, F needed to be cached in memory,
F is creating during this stage while A, B and E had already been created in previous. When
using the LRU eviction strategy, partition of A will be evicted first. However, the time cost
in\ [A\] -> B -> \[C\] may be much less than hdfs ->\ [A\], so evict \[C\] may be
better than evict \[A\]. 
> A eviction strategy based on the creation cost may be better than LRU, by statisticing
each transformation's time during the creation of cached RDD partition (e.g. \[E\] only need
to statistic time cost in \[C\] -> D and D -> \[E\]) and time cost in needed shuffle
reading. When memory for RDD storage not sufficient, partition with the least creation cost
may be evicted first. So this strategy for be called as LCS. My current demo show better performance
gain than default LRU.
> This strategy needs to consider the following situation:
> 1. Unified Memory Management is provided after Spark 1.6, memory for execution during
recomputing a partition may be pretty different than the first time the partition created.
So before better thought, LCS may not be allowed in UMM mode. (Though my demo also show improvement
in LCS than LRU in UMM mode).
> 2. MEMORY_AND_DISK_SER or other similar storage level may serialize RDD partition. By
estimating ser/deserialize cost and compare to creation cost, if the ser/deserialize cost
even larger than recreation, not serialize but directly removed from memory. As existing storage
level only allowed for the whole RDD, so a new storage level may be needed for RDD partition
to directly determine whether to serialize or just remove from memory.
> Besides LCS, FIFO or LFU is easy to be implemented.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message