spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tdas <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-6128][Streaming][Documentation] Updates...
Date Tue, 10 Mar 2015 00:11:02 GMT
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4956#discussion_r26089297
  
    --- Diff: docs/streaming-programming-guide.md ---
    @@ -1801,40 +1894,40 @@ temporary data rate increases maybe fine as long as the delay
reduces back to a
     
     ## Memory Tuning
     Tuning the memory usage and GC behavior of Spark applications have been discussed in
great detail
    -in the [Tuning Guide](tuning.html). It is recommended that you read that. In this section,
    -we highlight a few customizations that are strongly recommended to minimize GC related
pauses
    -in Spark Streaming applications and achieving more consistent batch processing times.
    -
    -* **Default persistence level of DStreams**: Unlike RDDs, the default persistence level
of DStreams
    -serializes the data in memory (that is,
    -[StorageLevel.MEMORY_ONLY_SER](api/scala/index.html#org.apache.spark.storage.StorageLevel$)
for
    -DStream compared to
    -[StorageLevel.MEMORY_ONLY](api/scala/index.html#org.apache.spark.storage.StorageLevel$)
for RDDs).
    -Even though keeping the data serialized incurs higher serialization/deserialization overheads,
    -it significantly reduces GC pauses.
    -
    -* **Clearing persistent RDDs**: By default, all persistent RDDs generated by Spark Streaming
will
    - be cleared from memory based on Spark's built-in policy (LRU). If `spark.cleaner.ttl`
is set,
    - then persistent RDDs that are older than that value are periodically cleared. As mentioned
    - [earlier](#operation), this needs to be careful set based on operations used in the
Spark
    - Streaming program. However, a smarter unpersisting of RDDs can be enabled by setting
the
    - [configuration property](configuration.html#spark-properties) `spark.streaming.unpersist`
to
    - `true`. This makes the system to figure out which RDDs are not necessary to be kept
around and
    - unpersists them. This is likely to reduce
    - the RDD memory usage of Spark, potentially improving GC behavior as well.
    -
    -* **Concurrent garbage collector**: Using the concurrent mark-and-sweep GC further
    -minimizes the variability of GC pauses. Even though concurrent GC is known to reduce
the
    +in the [Tuning Guide](tuning.html#memory-tuning). It is strongly recommended that you
read that. In this section, we discuss a few tuning parameters specifically in the context
of Spark Streaming applications.
    +
    +The amount of cluster memory required by a Spark Streaming application depends heavily
on the type of transformations used. For example, if you want to use a window operation on
last 10 minutes of data, then your cluster should have sufficient memory to hold 10 minutes
of worth of data in memory. Or if you want to use `updateStateByKey` with a large number of
keys, then the necessary memory  will be high. On the contrary, if you want to do a simple
map-filter-store operation, then necessary memory will be low.
    +
    +In general, since the data received through receivers are stored with StorageLevel.MEMORY_AND_DISK_SER_2,
the data that does not fit in memory will spill over to the disk. This may reduce the performance
of the streaming application, and hence it is advised to provide sufficient memory as required
by your streaming application. Its best to try and see the memory usage on a small scale and
estimate accordingly. 
    +
    +Another aspect of memory tuning is garbage collection. For a streaming application that
require low latency, it is undesirable to have large pauses caused by JVM Garbage Collection.

    +
    +There are a few parameters that can help you tune the memory usage and GC overheads.
    +
    +* **Persistence Level of DStreams**: As mentioned earlier in the [Data Serialization](#data-serialization)
section, the input data and RDDs are by default persisted as serialized bytes. This reduces
both, the memory usage and GC overheads, compared to deserialized persistence. Enabling Kryo
serialization, further reduces serialized sizes and memory usage. Further reduction in usage
using compression (see the Spark configuration `spark.rdd.compress`) can come at the cost
of CPU time.
    +
    +* **Clearing old data**: By default, all input data and persisted RDDs generated by DStream
transformations are automatically cleared. Spark Streaming decides when to clear the data
based on the transformations that are used. For example, if you are using window operation
of 10 minutes, then Spark Streaming will keep around last 10 minutes of data, and actively
throw away older data. 
    +Data can be retained for longer duration (e.g. interactively querying older data) by
setting `streamingContext.remember`.
    +
    +* **CMS Garbage Collector**: Use of the concurrent mark-and-sweep GC is strongly recommended
for keeping GC-related pauses consistently low. Even though concurrent GC is known to reduce
the
     overall processing throughput of the system, its use is still recommended to achieve
more
    -consistent batch processing times.
    +consistent batch processing times. Make sure you set the CMS GC on both the driver (using
`--driver-java-options` in `spark-submit`) and the executors (using [Spark configuration](configuration.html#runtime-environment)
`spark.executor.extraJavaOptions`).
    +
    +* **Other tips**: To further reduce GC overheads, here are some more tips to try.
    +    - Use Tachyon for off-heap storage of persisted RDDs. See more detail in the [Spark
Programming Guide](programming-guide.html#rdd-persistence).
    +    - Use more executors with smaller heap sizes. This will reduce the GC pressue within
each JVM heap.
    +
     
     ***************************************************************************************************
     ***************************************************************************************************
     
     # Fault-tolerance Semantics
     In this section, we will discuss the behavior of Spark Streaming applications in the
event
    -of node failures. To understand this, let us remember the basic fault-tolerance semantics
of
    -Spark's RDDs.
    +of failures. 
    +
    +## Background
    +{:.no_toc}
    +To understand the semantics provided by Spark Streaming, let us remember the basic fault-tolerance
semantics of sSpark's RDDs.
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message