cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Lynch (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
Date Wed, 04 Apr 2018 00:50:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424802#comment-16424802
] 

Joseph Lynch edited comment on CASSANDRA-14358 at 4/4/18 12:49 AM:
-------------------------------------------------------------------

For what it's worth we've opened an AWS ticket to find out if this is expected behaviour in
VPCs (that the network blackholes packets on unknown flows rather than resetting them). I
have a feeling that answer will be "yea that's how stateful firewalls work we can't keep
flows forever", in which case Cassandra (probably) should be resilient to it I think.

[~djoshi3] I think you've got it now. I'm proposing that we tune Cassandra's socket options
(opt-in of course like most socket options we set) to tune TCP's parameters per Cassandra's
requirement of high availability. In the case of blocking socket the way that I'm familiar
with are connect timeouts ([unix|http://man7.org/linux/man-pages/man2/connect.2.html], [java|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#connect-java.net.SocketAddress-int-]),
receive timeouts ([SO_RCVTIMEO|https://linux.die.net/man/7/socket], [java|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#setSoTimeout-int-])
and send timeouts which Java doesn't expose afaict ([SO_SNDTIMEO|https://linux.die.net/man/7/socket]).
For {{O_NONBLOCK}} sockets I think typically the timeout is handled at the event loop level
(e.g. I think that's what {{WriteTimeoutHandler}} and {{ReadTimeoutHandler}} do in Netty).

I still need to test this thoroughly, but I believe asking users to set {{/proc/sys/net/ipv4/tcp_retries2
= 5}} is a reasonable workaround. I was able to simulate pretty easily the exact error in
the production logs by doing the following:
{noformat}
$ sudo lsof -p $(pgrep -f Cassandra) -n | grep OTHER_IP | grep STORAGE_PORT
java    5277 cassandra  64u     IPv4             113860      0t0        TCP LOCAL_IP:38665->OTHER_IP:STORAGE_PORT
(ESTABLISHED)
java    5277 cassandra  69u     IPv4             114797      0t0        TCP LOCAL_IP:STORAGE_PORT->OTHER_IP:54875
(ESTABLISHED)

$ sudo iptables -A OUTPUT -p tcp -d OTHER_IP --sport 38665 --dport STORAGE_PORT -j DROP

$ cqlsh
cqlsh> CONSISTENCY ALL
Consistency level set to ALL.
cqlsh> select * from test_ks.test_cf WHERE key = 'key_that_lives_on_OTHER_IP';
OperationTimedOut: errors={}, last_host=<something>
... timeouts happen for next ~15 minutes

$ tail -20 system.log 
TRACE [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:05,009 OutboundTcpConnection.java:365
- Socket to /OTHER_IP closed
DEBUG [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:05,010 OutboundTcpConnection.java:303
- error writing to /OTHER_IP
java.net.SocketException: Connection timed out (Write failed)
	at java.net.SocketOutputStream.socketWrite0(Native Method) ~[na:1.8.0_152]
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[na:1.8.0_152]
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[na:1.8.0_152]
	at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431) ~[na:1.8.0_152]
	at sun.security.ssl.OutputRecord.write(OutputRecord.java:417) ~[na:1.8.0_152]
	at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:886) ~[na:1.8.0_152]
	at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:857) ~[na:1.8.0_152]
	at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123) ~[na:1.8.0_152]
	at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
~[lz4-1.2.0.jar:na]
	at net.jpountz.lz4.LZ4BlockOutputStream.flush(LZ4BlockOutputStream.java:222) ~[lz4-1.2.0.jar:na]
	at org.apache.cassandra.io.util.DataOutputStreamPlus.flush(DataOutputStreamPlus.java:55)
~[cassandra-2.1.19.jar:2.1.19]
	at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:294)
[cassandra-2.1.19.jar:2.1.19]
	at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:222) [cassandra-2.1.19.jar:2.1.19]
DEBUG [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:49,867 OutboundTcpConnection.java:380
- attempting to connect to /OTHER_IP
INFO  [HANDSHAKE-/OTHER_IP] 2018-04-04 00:13:49,916 OutboundTcpConnection.java:496 - Handshaking
version with /OTHER_IP
TRACE [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:49,960 OutboundTcpConnection.java:453
- Upgrading OutputStream to be compressed

$ cqlsh
cqlsh> CONSISTENCY ALL
Consistency level set to ALL.
cqlsh> select * from test_ks.test_cf WHERE key = 'key_that_lives_on_OTHER_IP';
... results come back and everything is happy

$ sudo lsof -p $(pgrep -f Cassandra) -n | grep OTHER_IP | grep STORAGE_PORT
java    5277 cassandra  134u     IPv4             113860      0t0        TCP LOCAL_IP:33417->OTHER_IP:STORAGE_PORT
(ESTABLISHED)
java    5277 cassandra   69u     IPv4             114797      0t0        TCP LOCAL_IP:STORAGE_PORT->OTHER_IP:54875
(ESTABLISHED)

{noformat}
This simulates "blackholing" of the {{OutboundTcpConnection}}. With the default linux value
of {{/proc/sys/net/ipv4/tcp_retries2 = 15}} this takes ~15 minutes to resolve (the operating
system eventually saves Cassandra). With the changed value of ~3, the partition only lasts
~15 seconds. If Cassandra trunk set (configurable) read and write timeouts we could control
how long it takes to recover from such a partition rather than relying on the OS.


was (Author: jolynch):
For what it's worth we've opened an AWS ticket to find out if this is expected behaviour in
VPCs (that the network blackholes packets on unknown flows rather than resetting them). I
have a feeling that answer will be "yea that's how stateful firewalls work we can't keep
flows forever", in which case Cassandra (probably) should be resilient to it I think.

[~djoshi3] I think you've got it now. I'm proposing that we tune Cassandra's socket options
to tune TCP's parameters per Cassandra's requirement of high availability. In the case of
blocking socket the way that I'm familiar with are connect timeouts ([unix|http://man7.org/linux/man-pages/man2/connect.2.html],
[java|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#connect-java.net.SocketAddress-int-]),
receive timeouts ([SO_RCVTIMEO|https://linux.die.net/man/7/socket], [java|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#setSoTimeout-int-])
and send timeouts which Java doesn't expose afaict ([SO_SNDTIMEO|https://linux.die.net/man/7/socket]).
For {{O_NONBLOCK}} sockets I think typically the timeout is handled at the event loop level
(e.g. I think that's what {{WriteTimeoutHandler}} and {{ReadTimeoutHandler}} do in Netty).

I still need to test this thoroughly, but I believe asking users to set {{/proc/sys/net/ipv4/tcp_retries2
= 5}} is a reasonable workaround. I was able to simulate pretty easily the exact error in
the production logs by doing the following:
{noformat}
$ sudo lsof -p $(pgrep -f Cassandra) -n | grep OTHER_IP | grep STORAGE_PORT
java    5277 cassandra  64u     IPv4             113860      0t0        TCP LOCAL_IP:38665->OTHER_IP:STORAGE_PORT
(ESTABLISHED)
java    5277 cassandra  69u     IPv4             114797      0t0        TCP LOCAL_IP:STORAGE_PORT->OTHER_IP:54875
(ESTABLISHED)

$ sudo iptables -A OUTPUT -p tcp -d OTHER_IP --sport 38665 --dport STORAGE_PORT -j DROP

$ cqlsh
cqlsh> CONSISTENCY ALL
Consistency level set to ALL.
cqlsh> select * from test_ks.test_cf WHERE key = 'key_that_lives_on_OTHER_IP';
OperationTimedOut: errors={}, last_host=<something>
... timeouts happen for next ~15 minutes

$ tail -20 system.log 
TRACE [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:05,009 OutboundTcpConnection.java:365
- Socket to /OTHER_IP closed
DEBUG [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:05,010 OutboundTcpConnection.java:303
- error writing to /OTHER_IP
java.net.SocketException: Connection timed out (Write failed)
	at java.net.SocketOutputStream.socketWrite0(Native Method) ~[na:1.8.0_152]
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[na:1.8.0_152]
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[na:1.8.0_152]
	at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431) ~[na:1.8.0_152]
	at sun.security.ssl.OutputRecord.write(OutputRecord.java:417) ~[na:1.8.0_152]
	at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:886) ~[na:1.8.0_152]
	at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:857) ~[na:1.8.0_152]
	at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123) ~[na:1.8.0_152]
	at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
~[lz4-1.2.0.jar:na]
	at net.jpountz.lz4.LZ4BlockOutputStream.flush(LZ4BlockOutputStream.java:222) ~[lz4-1.2.0.jar:na]
	at org.apache.cassandra.io.util.DataOutputStreamPlus.flush(DataOutputStreamPlus.java:55)
~[cassandra-2.1.19.jar:2.1.19]
	at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:294)
[cassandra-2.1.19.jar:2.1.19]
	at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:222) [cassandra-2.1.19.jar:2.1.19]
DEBUG [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:49,867 OutboundTcpConnection.java:380
- attempting to connect to /OTHER_IP
INFO  [HANDSHAKE-/OTHER_IP] 2018-04-04 00:13:49,916 OutboundTcpConnection.java:496 - Handshaking
version with /OTHER_IP
TRACE [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:49,960 OutboundTcpConnection.java:453
- Upgrading OutputStream to be compressed

$ cqlsh
cqlsh> CONSISTENCY ALL
Consistency level set to ALL.
cqlsh> select * from test_ks.test_cf WHERE key = 'key_that_lives_on_OTHER_IP';
... results come back and everything is happy

$ sudo lsof -p $(pgrep -f Cassandra) -n | grep OTHER_IP | grep STORAGE_PORT
java    5277 cassandra  134u     IPv4             113860      0t0        TCP LOCAL_IP:33417->OTHER_IP:STORAGE_PORT
(ESTABLISHED)
java    5277 cassandra   69u     IPv4             114797      0t0        TCP LOCAL_IP:STORAGE_PORT->OTHER_IP:54875
(ESTABLISHED)

{noformat}
This simulates "blackholing" of the {{OutboundTcpConnection}}. With the default linux value
of {{/proc/sys/net/ipv4/tcp_retries2 = 15}} this takes ~15 minutes to resolve (the operating
system eventually saves Cassandra). With the changed value of ~3, the partition only lasts
~15 seconds. If Cassandra trunk set read and write timeouts we could control how long it takes
to recover from such a partition rather than relying on the OS.

> OutboundTcpConnection can hang for many minutes when nodes restart
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-14358
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>         Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running with {{internode_encryption:
all}} and the EC2 multi region snitch on Linux 4.13 within the same AWS region. Smallest
cluster I've seen the problem on is 12 nodes, reproduces more reliably on 40+ and 300 node
clusters consistently reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip addresses (not
the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>            Reporter: Joseph Lynch
>            Priority: Major
>         Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during longer (~5 minute+)
Cassandra restarts in 3.0.x and 2.1.x which can contribute to {{UnavailableExceptions}} during
rolling restarts of 3.0.x and 2.1.x clusters for us. I think I finally have a lead. It appears
that prior to trunk (with the awesome Netty refactor) we do not set socket connect timeouts
on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set {{SO_TIMEOUT}} as far as I can
tell on outbound connections either. I believe that this means that we could potentially block
forever on {{connect}} or {{recv}} syscalls, and we could block forever on the SSL Handshake
as well. I think that the OS will protect us somewhat (and that may be what's causing the
eventual timeout) but I think that given the right network conditions our {{OutboundTCPConnection}}
threads can just be stuck never making any progress until the OS intervenes.
> I have attached some logs of such a network partition during a rolling restart where
an old node in the cluster has a completely foobarred {{OutboundTcpConnection}} for ~10 minutes
before finally getting a {{java.net.SocketException: Connection timed out (Write failed)}}
and immediately successfully reconnecting. I conclude that the old node is the problem because
the new node (the one that restarted) is sending ECHOs to the old node, and the old node is
sending ECHOs and REQUEST_RESPONSES to the new node's ECHOs, but the new node is never getting
the ECHO's. This appears, to me, to indicate that the old node's {{OutboundTcpConnection}} thread
is just stuck and can't make any forward progress. By the time we could notice this and slap
TRACE logging on, the only thing we see is ~10 minutes later a {{SocketException}} inside
{{writeConnected}}'s flush and an immediate recovery. It is interesting to me that the exception
happens in {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write failure}} I
believe that this can't be a connection reset), because my understanding is that we should
have a fully handshaked SSL connection at that point in the code.
> Current theory:
>  # "New" node restarts,  "Old" node calls [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433]
>  # Old node starts [creating a new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L141] SSL
socket 
>  # SSLSocket calls [createSocket|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/security/SSLFactory.java#L98],
which conveniently calls connect with a default timeout of "forever". We could hang here forever
until the OS kills us.
>  # If we continue, we get to [writeConnected|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L263]
which eventually calls [flush|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L341]
on the output stream and also can hang forever. I think the probability is especially high
when a node is restarting and is overwhelmed with SSL handshakes and such.
> I don't fully understand the attached traceback as it appears we are getting a {{Connection
Timeout}} from a {{send}} failure (my understanding is you can only get a connection timeout
prior to a send), but I think it's reasonable that we have a timeout configuration issue. I'd
like to try to make Cassandra robust to networking issues like this via maybe:
>  # Change the {{SSLSocket}} {{getSocket}} methods to provide connection timeouts of 2s
(equivalent to trunk's [timeout|https://github.com/apache/cassandra/blob/11496039fb18bb45407246602e31740c56d28157/src/java/org/apache/cassandra/net/async/NettyFactory.java#L329])
>  # Appropriately set recv timeouts via {{SO_TIMEOUT}}, maybe something like 2 minutes
(in old versions via [setSoTimeout|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#setSoTimeout-int-],
in trunk via [SO_TIMEOUT|http://netty.io/4.0/api/io/netty/channel/ChannelOption.html#SO_TIMEOUT]
>  # Since we can't set send timeouts afaik (thanks java) maybe we can have some kind of
watchdog that ensures OutboundTcpConnection is making progress in its queue and if it doesn't
make any progress for ~30s-1m, forces a disconnect.
> If anyone has insight or suggestions, I'd be grateful. I am going to rule out if this
is keepalive duration by setting tcp_keepalive_probes to like 1 and maybe tcp_retries2 to
like 8 get more information about the state of the tcp connections the next time this happens.
It's a very rare bug and when it does happen I only have 10 minutes to jump on the nodes and
fix it before it fixes itself so I'll do my best.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message