spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jungtaek Lim (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-24441) Expose total size of states in HDFSBackedStateStoreProvider
Date Sun, 03 Jun 2018 04:28:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-24441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jungtaek Lim updated SPARK-24441:
---------------------------------
    Description: 
While Spark exposes state metrics for single state, Spark still doesn't expose overall memory
usage of state (loadedMaps) in HDFSBackedStateStoreProvider. 

The rationalize of the patch is that state backed by HDFSBackedStateStoreProvider will consume
more memory than the number what we can get from query status due to caching multiple versions
of states. The memory footprint to be much larger than query status reports in situations
where the state store is getting a lot of updates: while shallow-copying map incurs additional
small memory usages due to the size of map entities and references, but row objects will still
be shared across the versions. If there're lots of updates between batches, less row objects
will be shared and more row objects will exist in memory consuming much memory then what we
expect.

It would be better to expose it as well so that end users can determine actual memory usage
for state.

  was:
While Spark exposes state metrics for single state, Spark still doesn't expose overall memory
usage of state (loadedMaps) in HDFSBackedStateStoreProvider. 

Since HDFSBackedStateStoreProvider caches multiple versions of entire state in hashmap, this
can occupy much memory than single version of state. Based on the default value of minVersionsToRetain,
the size of cache map can grow more than 100 times of the size of single state. It would
be better to expose it as well so that end users can determine actual memory usage for state.


> Expose total size of states in HDFSBackedStateStoreProvider
> -----------------------------------------------------------
>
>                 Key: SPARK-24441
>                 URL: https://issues.apache.org/jira/browse/SPARK-24441
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.3.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>
> While Spark exposes state metrics for single state, Spark still doesn't expose overall
memory usage of state (loadedMaps) in HDFSBackedStateStoreProvider. 
> The rationalize of the patch is that state backed by HDFSBackedStateStoreProvider will
consume more memory than the number what we can get from query status due to caching multiple
versions of states. The memory footprint to be much larger than query status reports in situations
where the state store is getting a lot of updates: while shallow-copying map incurs additional
small memory usages due to the size of map entities and references, but row objects will still
be shared across the versions. If there're lots of updates between batches, less row objects
will be shared and more row objects will exist in memory consuming much memory then what we
expect.
> It would be better to expose it as well so that end users can determine actual memory
usage for state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message