mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Rukletsov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MESOS-7748) Slow subscribers of streaming APIs can lead to Mesos OOMing.
Date Tue, 15 Aug 2017 15:01:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127095#comment-16127095
] 

Alexander Rukletsov edited comment on MESOS-7748 at 8/15/17 3:00 PM:
---------------------------------------------------------------------

This problem described in this ticket is well studied: [TCP/IP Orphaned Connections Vulnerability|http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-1926],
[Slow read DoS/DDos attack|https://blog.qualys.com/securitylabs/2012/01/05/slow-read], [TCP
receive window closed indefinitely|https://www.kb.cert.org/vuls/id/723308].

There are several things to consider regarding this attack:
* Does the attacker read slowly or stop reading at all at some point, e.g., when its TCP buffer
overflows?
* Are there multiple attackers from different IP addresses?
* What is the "cost", i.e., memory, CPU, of a stalled connection?

The general recommendation of the IETF [TCP Maintenance and Minor Extensions|http://www.ietf.org/dyn/wg/charter/tcpm-charter.html]
working group is to [selectively abort TCP connections that appear to be malicious under resource
exhaustion conditions|https://www.kb.cert.org/vuls/id/723308]. Detecting misbehaving HTTP
connections is not a trivial task; any solutions is trade-off between improved resiliency
and decreased QoS. 

Here are the most popular practical mitigation startegies (in order of increasing complexity):
* Absolute connection timeout, e.g., [Go HTTP library|https://golang.org/pkg/net/#Conn], see
\[[#1]\] for more details.
* Idle connection timeout, write timeout, e.g., [Lighttpd|https://redmine.lighttpd.net/projects/1/wiki/Server_max-write-idleDetails].
[Some sources|https://www.academia.edu/9346526/Analysis_of_Slow_Read_DoS_Attack_and_Countermeasures]
suggest at least 10 seconds in order to maintain reasonable QoS. 
* Max clients per IP address, e.g., [ModSecurity in Apache|https://github.com/SpiderLabs/ModSecurity/wiki/Reference-Manual#secconnwritest].
* Data transfer rate, e.g., [Barracuda Load Balancers|https://campus.barracuda.com/product/campus/article/display/LBADCv50/17106014/].
* Incremental (adaptive) response timeout, e.g., [Barracuda Load Balancers|https://campus.barracuda.com/product/campus/article/display/LBADCv50/17106014/].

{anchor:1} \[1\] I've played a little bit with Go HTTP library, see the test binary [here|https://github.com/rukletsov/http-stream-test].
The low level [connection class|https://golang.org/pkg/net/#Conn] performs [blocking writes|https://golang.org/src/net/net.go?s=6546:6589#L179].
Connection timeouts, called [deadlines|https://golang.org/pkg/net/#Conn], can be applied for
a connection, not for a single write / read operation. Idle timeouts can be implemented by
regularly extending deadlines.

A high level [HTTP server class|https://golang.org/pkg/net/http/#Server] defines write and
read timeouts, that are transformed into deadlines. However, deadlines are refreshed only
when a new request comes in, meaning an indefinite (or long enough) streamed write is interrupted
after the timeout. The suggested solution seems to hijack the connection and implement writing
and buffering logic on the application level.


was (Author: alexr):
This problem described in this ticket is well studied: [TCP/IP Orphaned Connections Vulnerability|http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-1926],
[Slow read DoS/DDos attack|https://blog.qualys.com/securitylabs/2012/01/05/slow-read], [TCP
receive window closed indefinitely|https://www.kb.cert.org/vuls/id/723308].

There are several things to consider regarding this attack:
* Does the attacker read slowly or stop reading at all at some point, e.g., when its TCP buffer
overflows?
* Are there multiple attackers from different IP addresses?
* What is the "cost", i.e., memory, CPU, of a stalled connection?

The general recommendation of the IETF [TCP Maintenance and Minor Extensions|http://www.ietf.org/dyn/wg/charter/tcpm-charter.html]
working group is to [selectively abort TCP connections that appear to be malicious under resource
exhaustion conditions|https://www.kb.cert.org/vuls/id/723308]. Detecting misbehaving HTTP
connections is not a trivial task; any solutions is trade-off between improved resiliency
and decreased QoS. 

Here are the most popular practical mitigation startegies (in order of increasing complexity):
* Absolute connection timeout, e.g., [Go HTTP library|https://golang.org/pkg/net/#Conn], see
\[[#1]\] for more details.
* Idle connection timeout, write timeout, e.g., [Lighttpd| https://redmine.lighttpd.net/projects/1/wiki/Server_max-write-idleDetails].
[Some sources|https://www.academia.edu/9346526/Analysis_of_Slow_Read_DoS_Attack_and_Countermeasures]
suggest at least 10 seconds in order to maintain reasonable QoS. 
* Max clients per IP address, e.g., [ModSecurity in Apache| https://github.com/SpiderLabs/ModSecurity/wiki/Reference-Manual#secconnwritest
* Data transfer rate, e.g., [Barracuda Load Balancers| https://campus.barracuda.com/product/campus/article/display/LBADCv50/17106014/]
* Incremental (adaptive) response timeout, e.g., [Barracuda Load Balancers| https://campus.barracuda.com/product/campus/article/display/LBADCv50/17106014/]

{anchor:1} \[1\] I've played a little bit with Go HTTP library, see the test binary [here|https://github.com/rukletsov/http-stream-test].
The low level [connection class|https://golang.org/pkg/net/#Conn] performs [blocking writes|https://golang.org/src/net/net.go?s=6546:6589#L179].
Connection timeouts, called [deadlines|https://golang.org/pkg/net/#Conn], can be applied for
a connection, not for a single write / read operation. Idle timeouts can be implemented by
regularly extending deadlines.

A high level [HTTP server class|https://golang.org/pkg/net/http/#Server] defines write and
read timeouts, that are transformed into deadlines. However, deadlines are refreshed only
when a new request comes in, meaning an indefinite (or long enough) streamed write is interrupted
after the timeout. The suggested solution seems to hijack the connection and implement writing
and buffering logic on the application level.

> Slow subscribers of streaming APIs can lead to Mesos OOMing.
> ------------------------------------------------------------
>
>                 Key: MESOS-7748
>                 URL: https://issues.apache.org/jira/browse/MESOS-7748
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Alexander Rukletsov
>            Assignee: Alexander Rukletsov
>            Priority: Critical
>              Labels: mesosphere, reliability
>
> For each active subscriber, Mesos master / slave maintains an event queue, which grows
over time if the subscriber does not read fast enough. As the number of such "slow" subscribers
grows, so does Mesos master / slave memory consumption, which might lead to an OOM event.
> Ideas to consider:
> * Restrict the number of subscribers for the streaming APIs
> * Check (ping) for inactive or "slow" subscribers
> * Disconnect the subscriber when there are too many queued events in memory



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message