flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject [ANNOUNCE] Stability/Scalability effort and Pull Requests
Date Fri, 02 Dec 2016 13:48:17 GMT
Hi fellow Flink enthusiasts!

Over the past weeks, several community members have been working hard on a
thread of issues to make Flink more stable and scalable for some demanding
large scale deployments.

It was quite a lot of work digging through the masses of logs, setting up
clusters to reproduce issues, debugging, and so on. Because of that
intensive time involvement, some of us spent considerably less time on Pull
Requests - as a result we have quite a backlog of pull requests.

We are nearing the end of this effort, and I believe the pull requests can
expect to get more attention again in the near future.


The *good news* is that this effort really pushed Flink a lot further. We
addressed a lot of issues that now make Flink run a lot smoother in various
setups.

Here is a sample of issues we addressed as part of that work:

  - The Network Stack had a quite serious issue with (un)fairness of stream
transport under backpressure.

  - Checkpoints are more robust now - they decline better when they cannot
complete and can have a limit for how much data may be buffered/spilled
during alignment

  - Cleanup of old checkpoint state happens more reliably and scalable

  - Robustness of HA recovery is increased, in the presence of ZooKeeper
problems and inconsistent state in ZooKeeper

  - Improving the cancellation behavior of checkpoints and connectors.
Adding a safeguard against tasks and libraries that block the TaskManagers
during cancellation.

  - We were diagnosing a RocksDB bug and updating to fixed version

  - Deployment scalability: Getting rid of some RPC messages, handling a
large RPC volume on deployment better

  - Memory leaks in the JobManager in the presence of many restarts

  - Some bugs in the life cycle of tasks and the execution graph to prevent
issues with reliable fault tolerance.


I think we'll all enjoy the results of this effort, especially those that
have to deploy and maintain large setups.


Greetings,
Stephan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message