flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Richter <s.rich...@ververica.com>
Subject Re: [DISCUSS] Proposal to support disk spilling in HeapKeyedStateBackend
Date Wed, 29 May 2019 08:33:43 GMT
Hi Yu,

Sorry for the late reaction. As already discussed internally, I think this is a very good
proposal and design that can help to improve a major limitation of the current state backend.
I think that most discussion is happening in the design doc and I left my comments there.
Looking forward to seeing this integrated with Flink soon!

Best,
Stefan

> On 24. May 2019, at 14:50, Yu Li <carp84@gmail.com> wrote:
> 
> Hi All,
> 
> As mentioned in our speak[1] given in FlinkForwardChina2018, we have improved HeapKeyedStateBackend
to support disk spilling and put it in production here in Alibaba for last year's Singles'
Day. Now we're ready to upstream our work and the design doc is up for review[2]. Please let
us know your point of the feature and any comment is welcomed/appreciated.
> 
> We plan to keep the discussion open for at least 72 hours, and will create umbrella jira
and subtasks if no objections. Thanks.
> 
> Below is a brief description about the motivation of the work, FYI:
> 
> HeapKeyedStateBackend is one of the two KeyedStateBackends in Flink, since state lives
as Java objects on the heap in HeapKeyedStateBackend and the de/serialization only happens
during state snapshot and restore, it outperforms RocksDBKeyeStateBackend when all data could
reside in memory.
> However, along with the advantage, HeapKeyedStateBackend also has its shortcomings, and
the most painful one is the difficulty to estimate the maximum heap size (Xmx) to set, and
we will suffer from GC impact once the heap memory is not enough to hold all state data. There’re
several (inevitable) causes for such scenario, including (but not limited to):
> * Memory overhead of Java object representation (tens of times of the serialized data
size).
> * Data flood caused by burst traffic.
> * Data accumulation caused by source malfunction.
> To resolve this problem, we proposed a solution to support spilling state data to disk
before heap memory is exhausted. We will monitor the heap usage and choose the coldest data
to spill, and reload them when heap memory is regained after data removing or TTL expiration,
automatically.
> 
> [1] https://files.alicdn.com/tpsservice/1df9ccb8a7b6b2782a558d3c32d40c19.pdf <https://files.alicdn.com/tpsservice/1df9ccb8a7b6b2782a558d3c32d40c19.pdf>
> [2] https://docs.google.com/document/d/1rtWQjIQ-tYWt0lTkZYdqTM6gQUleV8AXrfTOyWUZMf4/edit?usp=sharing
<https://docs.google.com/document/d/1rtWQjIQ-tYWt0lTkZYdqTM6gQUleV8AXrfTOyWUZMf4/edit?usp=sharing>
> Best Regards,
> Yu


Mime
View raw message