flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Piotr Nowojski (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
Date Thu, 13 Feb 2020 10:20:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036108#comment-17036108

Piotr Nowojski commented on FLINK-16030:

Currently our threading model and network stack can not reliably support heartbeats on data
network channels (we do have them on akka). The reason is that we are performing blocking
operations inside Netty threads (we were recently discussing [this here|http://mail-archives.apache.org/mod_mbox/flink-dev/202002.mbox/browser]).

Unless the keep alive is set to value like 1 hour, I would be afraid that If we add such feature,
we will get more false positive connection timeouts, confusing users and causing us more new
problems than solving old ones.

> Add heartbeat between netty server and client to detect long connection alive
> -----------------------------------------------------------------------------
>                 Key: FLINK-16030
>                 URL: https://issues.apache.org/jira/browse/FLINK-16030
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>            Reporter: begginghard
>            Priority: Major
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio packet loss).  
> When the long tcp connection between netty client and server is lost, the server would
failed to send response to the client, then shut down the channel. At the same time, the netty
client does not know that the connection has been disconnected, so it has been waiting for
two hours.
> To detect the long tcp connection alive on netty client and server, we should have two
ways: tcp keepalive and heartbeat.
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, you continue
to wait for 2 hours, the netty client will trigger exception and enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use ping-pang
mechanism. If netty client sends continuously n ping message and receives no one pang message,
then trigger exception.

This message was sent by Atlassian Jira

View raw message