httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From apache-...@dslr.net
Subject httpd-2.1.3-beta under a large DDOS attack ... not good.
Date Mon, 14 Mar 2005 21:21:04 GMT

*long but interesting, I hope*

I had the displeasure of coping with a large DDOS attack this weekend
and tested out how apache 2.1.3-beta did. It didn't do very well at all.

I realize this list is for "discussion of changes to the source code and
related issues" but I'm hoping this is still appropriate and would be
interested to get feedback.

The attack was from a botnet comprising, at any one moment, over 6000
unique IPs. New IPs were adding themselves fairly constantly, a dozen
every few minutes at least. They were also rolling off. The clients were
all windows XP and 2k machines, and judging from the ping time to many
of them, many were dialup or on other dynamic IPs. They were getting
their current attack targets from a php program on the webserver of a
rooted box, not from an IRC type control system.

The zombie army had a rather unique attack. They would send multiple syn
packets in order to try to open a connection on the web server.
syn_cookies coped ok with this. If they succeeded in opening a
connection to port 80 they would then send one random character at a
time, about one second apart. Each character would come in its own tcp
packet - so the tcp PSH flag was set. If you had an infinitely powerful
web server you would therefore soon be handling over 10,000 to 50,000
active connections, all doing nothing much - why more than the number of
unique ips? because the zombies were also doing this in parallel, so one
zombie could hold open more than one connection. At the same time, they
were also flooding our IP with fragmented ping packets of 1480 bytes
each in order to choke up our port. About 9/10ths of the traffic by
volume was fragmented pings and about 9/10ths of traffic by number of
packets was syn or 1 or 2 byte data packets along with associated RSTs
and so on.

The hardware on the machine coped ok with all this - about 60mbit
incoming traffic and linux 2.4 with NAPIfied latest e1000.o - but one
cpu was pretty much flat out picking up packets from the card.
Unfortunately, if more than 3000 IPs were added to netfilter as DROPs
then the box would start to fall behind (overruns on the card). So
blacklisting all bad ips was a non-starter. Even picking out bad ips
wasn't so easy as they look initially like a normal open request.

Ok, so how did the latest apache source cope with this, using the
mpm-worker module? Not too good. I tried a number of different ways. The
first thing I did was reduce the Timeout to 1 second. In this way I
hoped to fast-drop any connections that were dribbling characters.
Unfortunately, zombies sending characters 1 per second meant that apache
did not drop the connection fast enough, a zombie could keep a slot open
for 5-10 seconds until it got kicked.

Still, my apache configured for 6000 max clients, spread over 60 httpds
with 100 threads each, the server-status would soon show 4000-5000
active connections and could still serve legit requests (hardly
difficult - I was serving with 302 redirects with mod_rewrite to an
unmolested IP address in order to move users).

HOWEVER, even though server-status showed the config was stable like
this - as many new connections coming in as old ones dying off, and
although the server was functional, the memory on the box was being
consumed at a crazy rate. Within 40 seconds, over 1gb of physical memory
had vanished all sucked down by apache.. and unless all processes were
immediately killed, the box would move into swap space and become
totally unresponsive. (the box was an SMP xeon with 2gb of memory). So,
I had to kill and restart apache every 50 seconds.

It also did not matter what I did with RequestsPerChild, 1 or 50, I was
not getting memory back.

Lastly, no logging of 408 (timeout) errors were happening. I could
telnet to apache and sit and wait and get kicked after 1 second, and get
no 408 log line.

Apart from that issue, apache was crashing, especially if I tried to
config for more than 5000 clients. I would either receive out of memory
errors when doing thread creates, or other bad error messages relating
to one child, that would cause the entire server to shut down.

Here is an example of the scoreboard: notice, the server has been up for
just 20 seconds:

   Current Time: Sunday, 13-Mar-2005 14:54:39 EST
   Restart Time: Sunday, 13-Mar-2005 14:54:18 EST
   Parent Server Generation: 0
   Server uptime: 20 seconds
   3598 requests currently being processed, 987 idle workers

RR..C..CRRRRRRRRCRRRRRRRCRRR.RRRRRC.R.R..RRRRRRRCCCR..CRR.R.CRRR
RRRRCRR.RR.CRC.CCRCC..CCR..CC.CC.C....CCC..CC....R.C.R.RR.R.CC.C
CC..CC..CC.CR.C.R.CCCC...RR.R.CC.CCRC......C.CR......C...CC..C.C
..C.CCCRRCRCCRRRCRRRRRCRRRRRRRCRRRRCRCRRRRRRRRCCRRCRRRRCRRRRRRRR
RCCCCRCRRRRCRRRRRCRRRRCRRRRRRRRRRRRCRCRRRRRRRRCRRRCCRRCCRRCRCRCR
RRRRRCCCRRR_RRRRRRCRCC_CRCCRCRRRRRRCCCCRCRCRRRRCRRCRCRCRRCCRCCRC

Here is an example of the memory usage:

             total       used       free     shared    buffers     cached
Mem:       2063892    1301084     762808          0      24640     404512
-/+ buffers/cache:     871932    1191960
Swap:      2040244      84392    1955852
Sun Mar 13 14:02:31 EST 2005
             total       used       free     shared    buffers     cached
Mem:       2063892    1340288     723604          0      24640     404524
-/+ buffers/cache:     911124    1152768
Swap:      2040244      84392    1955852
Sun Mar 13 14:02:32 EST 2005
             total       used       free     shared    buffers     cached
Mem:       2063892    1398268     665624          0      24640     404524
-/+ buffers/cache:     969104    1094788
Swap:      2040244      84392    1955852
Sun Mar 13 14:02:32 EST 2005

Notice total/used/free and how quickly free is dropping even though the
number of in-play clients was hovering around the same figure
(3000-4000) the whole time. This was NOT increasing memory due to
increasing load.

Here is one worker.c config that didn't go well, it triggered the memory
problems in apr_thread_create etc:

<IfModule worker.c>
ServerLimit         200
MaxClients          6500
# per child..  l
ThreadLimit         250
ThreadsPerChild     250
#
StartServers        5000
MinSpareThreads     500
MaxSpareThreads     1500
# never expires...
MaxRequestsPerChild  50
</IfModule>

here is the current one, that worked better except for the memory leak,
and not being able to handle quite enough simultaneous connections:

<IfModule worker.c>
ServerLimit         200
MaxClients          3456
# per child..
ThreadLimit         64
ThreadsPerChild     64
#
StartServers        2000
MinSpareThreads     400
MaxSpareThreads     1000
# never expires...
MaxRequestsPerChild  50
</IfModule>

And Timeout was 1 second, and KeepAlive was off

Here is a strace of a thread dealing with a zombie:

poll([{fd=52, events=POLLIN, revents=POLLIN}], 1, 1000) = 1
read(52, "i", 8000)                     = 1
brk(0)                                  = 0x952a000
brk(0x952d000)                          = 0x952d000
poll([{fd=52, events=POLLIN, revents=POLLIN}], 1, 1000) = 1
read(52, "T", 8000)                     = 1
brk(0)                                  = 0x95a1000
brk(0x95a3000)                          = 0x95a3000
poll([{fd=52, events=POLLIN, revents=POLLIN}], 1, 1000) = 1
read(52, "K", 8000)                     = 1
brk(0)                                  = 0x9611000
brk(0x9613000)                          = 0x9613000
poll([{fd=52, events=POLLIN, revents=POLLIN}], 1, 1000) = 1
read(52, "h", 8000)                     = 1
brk(0)                                  = 0x967d000
brk(0x967f000)                          = 0x967f000
poll([{fd=52, events=POLLIN, revents=POLLIN}], 1, 1000) = 1
read(52, "h", 8000)                     = 1
brk(0)                                  = 0x9701000
brk(0x9703000)                          = 0x9703000
poll([{fd=52, events=POLLIN, revents=POLLIN}], 1, 1000) = 1
--More--(15%)

As you can see - one character per second, give or take.

Pretty much whatever I did when shooting for more than 5000 clients, I
would sooner or later get scattered:

[Sun Mar 13 13:23:58 2005] [alert] (12)Cannot allocate memory: apr_thread_create: unable to
create worker thread
[Sun Mar 13 13:24:26 2005] [alert] (12)Cannot allocate memory: apr_thread_create: unable to
create worker thread

Or I also saw:

[Sun Mar 13 17:47:45 2005] [alert] (12)Cannot allocate memory: apr_thread_create: unable to
create worker thread
httpd: misc/apr_reslist.c:156: reslist_cleanup: Assertion `rl->ntotal == 0' failed.

or

[Sat Mar 12 19:28:05 2005] [notice] seg fault or similar nasty error
detected in the parent process

usually the memory issues would then trigger:

[Sun Mar 13 13:35:59 2005] [alert] Child 26399 returned a Fatal error...\nApache is exiting!
[Sun Mar 13 13:35:59 2005] [warn] child process 32652 still did not exit, sending a SIGTERM

I'd already done:

        ulimit -Su 8192
        echo 16384 > /proc/sys/kernel/threads-max

in the apache startup.

This very same config and server, that worked so badly under this DDOS
copes much better with normal site traffic (no memory leak). But I still
get this (using a grep - for today only).

[Mon Mar 14 11:58:18 2005] [notice] child pid 10146 exit signal Segmentation fault (11)
[Mon Mar 14 11:58:22 2005] [notice] child pid 10150 exit signal Segmentation fault (11)
[Mon Mar 14 11:59:18 2005] [notice] child pid 13677 exit signal Segmentation fault (11)
[Mon Mar 14 13:30:25 2005] [notice] child pid 13200 exit signal Segmentation fault (11)
[Mon Mar 14 13:41:16 2005] [notice] child pid 15684 exit signal Segmentation fault (11)
[Mon Mar 14 13:42:43 2005] [notice] child pid 2271 exit signal Segmentation fault (11)
[Mon Mar 14 14:38:54 2005] [notice] child pid 19146 exit signal Segmentation fault (11)
[Mon Mar 14 15:35:36 2005] [notice] child pid 29345 exit signal Segmentation fault (11)
[Mon Mar 14 15:48:43 2005] [notice] child pid 10987 exit signal Segmentation fault (11)
[Mon Mar 14 16:00:03 2005] [notice] child pid 18190 exit signal Segmentation fault (11)


Note, i tried httpd-2.0.48, 2.0.53 and 2.1.3-beta. None of them coped
any differently with this DDOS.

My wishlist, is this unreasonable?

* Apache2 can handle 16000 active open connections on a reasonable sized
box, at least if they are all bogus and going to be rejected, without
recompilation of glibc.

* 408 errors are reported again (for identifying bad players). Maybe
this is my issue but i don't seem to get them.

* More than one kind of timeout can be set. For example, I would have
liked to have set a request phase timeout of 0.5 seconds or a total
request phase timeout of 2 seconds (not an idle timeout of 1 second).

* A flag for rejecting slow writers or peculiar ones (malformed garbage
gets you kicked sooner).

* Graceful non-crashing behavior when thread resources of one kind or
another are exceeded.

* Some kind of flag that turns on rejection of the oldest active and
pending request if there is no more room left and a new request is
received.


Mime
View raw message