cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Brown (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-13126) native transport protocol corruption when using SSL
Date Wed, 15 Feb 2017 14:50:41 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15867950#comment-15867950
] 

Jason Brown commented on CASSANDRA-13126:
-----------------------------------------

I agree with [~tvdw]'s assessment here: once part of the data stream is lost, you're pretty
much screwed. It's possible there's some route to recovery, but that probably includes some
degree of luck and fortuitous timing. I think the simplest solution is to close the channel/socket,
as I suspect error recovery code might be tricky and there may be security holes in that (I
am not a security expert so I may be wrong).

bq. Wouldn't frequently reconnecting clients possibly cause more memory pressure in this case
and further escalate the issue?

Quite possibly, although {{ConnectionLimitHandler}} might be able to help, but even that will
have some costs before it executes in a channel.

> native transport protocol corruption when using SSL
> ---------------------------------------------------
>
>                 Key: CASSANDRA-13126
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13126
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Tom van der Woerdt
>            Priority: Critical
>
> This is a series of conditions that can result in client connections becoming unusable.
> 1) Cassandra GC must be well-tuned, to have short GC pauses every minute or so
> 2) *client* SSL must be enabled and transmitting a significant amount of data
> 3) Cassandra must run with the default library versions
> 4) disableexplicitgc must be set (this is the default in the current cassandra-env.sh)
> This ticket relates to CASSANDRA-13114 which is a possible workaround (but not a fix)
for the SSL requirement to trigger this bug.
> * Netty allocates nio.ByteBuffers for every outgoing SSL message.
> * ByteBuffers consist of two parts, the jvm object and the off-heap object. The jvm object
is small and goes with regular GC cycles, the off-heap object gets freed only when the small
jvm object is freed. To avoid exploding the native memory use, the jvm defaults to limiting
its allocation to the max heap size. Allocating beyond that limit triggers a System.gc(),
a retry, and potentially an exception.
> * System.gc is a no-op under disableexplicitgc
> * This means ByteBuffers are likely to throw an exception when too many objects are being
allocated
> * The netty version shipped in Cassandra is broken when using SSL (see CASSANDRA-13114)
and causes significantly too many bytebuffers to be allocated.
> This gets more complicated though.
> When /some/ clients use SSL, and others don't, the clients not using SSL can still be
affected by this bug, as bytebuffer starvation caused by ssl will leak to other users.
> ByteBuffers are used very early on in the native protocol as well. Before even being
able to decode the network protocol, this error can be thrown :
> {noformat}
> io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct buffer memory
> {noformat}
> Note that this comes back with stream_id 0, so clients end up waiting for the client
timeout before the query is considered failed and retried.
> A few frames later on the same connection, this appears:
> {noformat}
> Provided frame does not appear to be Snappy compressed
> {noformat}
> And after that everything errors out with:
> {noformat}
> Invalid or unsupported protocol version (54); the lowest supported version is 3 and the
greatest is 4
> {noformat}
> So this bug ultimately affects the binary protocol and the connection becomes useless
if not downright dangerous.
> I think there are several things that need to be done here.
> * CASSANDRA-13114 should be fixed (easy, and probably needs to land in 3.0.11 anyway)
> * Connections should be closed after a DecoderException
> * DisableExplicitGC should be removed from the default JVM arguments
> Any of these three would limit the impact to clients.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message