spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [2/2] git commit: Merge pull request #409 from tdas/unpersist
Date Tue, 14 Jan 2014 06:29:47 GMT
Merge pull request #409 from tdas/unpersist

Automatically unpersisting RDDs that have been cleaned up from DStreams

Earlier RDDs generated by DStreams were forgotten but not unpersisted. The system relied on
the natural BlockManager LRU to drop the data. The cleaner.ttl was a hammer to clean up RDDs
but it is something that needs to be set separately and need to be set very conservatively
(at best, few minutes). This automatic unpersisting allows the system to handle this automatically,
which reduces memory usage. As a side effect it will also improve GC performance as there
are less number of objects stored in memory. In fact, for some workloads, it may allow RDDs
to be cached as deserialized, which speeds up processing without too much GC overheads.

This is disabled by default. To enable it set configuration spark.streaming.unpersist to true.
In future release, this will be set to true by default.

Also, reduced sleep time in TaskSchedulerImpl.stop() from 5 second to 1 second. From my conversation
with Matei, there does not seem to be any good reason for the sleep for letting messages be
sent out be so long.


Branch: refs/heads/master
Commit: 08b9fec93d00ff0ebb49af4d9ac72d2806eded02
Parents: b07bc02 27311b1
Author: Patrick Wendell <>
Authored: Mon Jan 13 22:29:03 2014 -0800
Committer: Patrick Wendell <>
Committed: Mon Jan 13 22:29:03 2014 -0800

 .../spark/scheduler/TaskSchedulerImpl.scala     |  5 +-
 .../streaming/api/java/JavaDStreamLike.scala    |  3 +-
 .../spark/streaming/dstream/DStream.scala       | 11 ++-
 .../dstream/DStreamCheckpointData.scala         |  2 +-
 .../spark/streaming/BasicOperationsSuite.scala  | 72 +++++++++++++-------
 5 files changed, 63 insertions(+), 30 deletions(-)

View raw message