spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aalobaidi <>
Subject [GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore
Date Mon, 11 Jun 2018 19:51:47 GMT
Github user aalobaidi commented on the issue:
    1. As I mentioned before, this option is beneficial for use cases with bigger micro-batches.
This way the overhead of loading the state from disk will be spread across large number of
events in each task. The overall throughput will not be affected as bad as loading the state
for small number of events. It is typical space-time tradeoff. Using less memory space will
result in increase in execution time. 
    2. Currently, as far as I know, there is no way for a Spark to handle state that is bigger
than the total size of the cluster memory. With this option, users can set the number of partitions
(using `spark.sql.shuffle.partitions`)  to be, for example, 10 times the number of total cores
in the cluster. The result will be the cluster will load only ~10% of the state at any given
    3. By default, the option is disabled and will not change the current behavior. Also,
enabling and disabling this option doesn't require rebuilding the state. Users will be able
test easily and decide if the performance impact is acceptable or not.
    4. In future implementations, we can have an executor-wide state manager, that will evict
state partitions from memory only when needed. 
    Also, I agree that a better solution should be developed, maybe using RocksDB. But for
the time being and for this store implementation, this will enable one extra use case with
very little code to maintain. 


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message