httpd-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Harvey <alexharv...@gmail.com>
Subject [users@httpd] intermittent timeouts
Date Mon, 24 Oct 2011 08:21:01 GMT
Dear List,

I have observed a strange problem with a busy Apache webserver that
functions as a reverse proxy serving static content.

I am running RHEL 5u5 2.6.18-238.12.1.el5 and Apache/2.2.3.

During the site's peak load periods we intermittently see response time to a
wget of index.html jump from the average of ~ 0.0074 seconds to anything
from 10 - 1000 seconds.   These "slows" appear during bursts of high load
where around 1100 httpd.

Typical vmstat during the peak load is
Sun Oct 23 18:55:03 EST 2011
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa st
 0  1    256 4277684 108352 860180    0    0     4    10    2    5  1  1 98
0  0
 0  0    256 4257108 108376 860268    0    0     0   226 2157 3683  1  1 97
0  0
 0  0    256 4213320 108392 860360    0    0     0   198 2439 3988  1  1 98
0  0
 0  0    256 4209184 108396 860444    0    0     0   168 2600 4649  1  0 99
0  0
 1  0    256 4207688 108404 860504    0    0     0   180 2092 3647  1  1 98
0  0
 0  0    256 4203120 108404 860596    0    0     0     2 2170 4033  1  0 99
0  0
 0  1    256 4204436 108412 860656    0    0     0    88 2569 5322  1  1 97
0  0
 0  0    256 4207360 108424 860736    0    0     0   278 2216 6223  3  4 94
0  0
 1  0    256 4214052 108436 860796    0    0     0   240 2355 4726  1  1 98
0  0
 0  0    256 4224244 108452 860940    0    0     0   248 2749 5310  2  1 97
0  0

System is certainly not CPU bound during the slows and although I see memory
dip during the slows, it doesn't get to 0 and the system doesn't start
swapping.  (I am not sure how 256K ended up in swap but swapping is rare.)

Now during the intermittent "slows" we can make the following observations:

- Using strace we see read system calls on file descriptor 15, 22 and 25
returning EAGAIN (resource temporarily unavailable).  Sometimes semop()
calls also return EAGAIN.

Example:

[pid  4773] 18:30:23.466468 accept(5, {sa_family=AF_INET,
sin_port=htons(35940), sin_addr=inet_addr("1.1.1.1")}, [17179869200]) = 22
[pid  4773] 18:30:23.466825 semop(878542868, 0x2add48c96e06, 1) = 0
[pid  4773] 18:30:23.467163 fcntl(22, F_GETFL) = 0x2 (flags O_RDWR)
[pid  4773] 18:30:23.467495 fcntl(22, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid  4773] 18:30:23.467832 read(22, "GET /somepath/"..., 8000) = 718
[pid  4773] 18:30:23.468323 stat("/apps/www/DocumentRoot/video-icon.gif",
{st_mode=S_IFREG|0644, st_size=1943, ...}) = 0
[pid  4773] 18:30:23.469119 open("/apps/www/DocumentRoot/video-icon.gif",
O_RDONLY) = 25
[pid  4773] 18:30:23.469488 read(22, 0x2add5333d758, 8000) = -1 EAGAIN
(Resource temporarily unavailable)
[pid  4773] 18:30:23.469836 setsockopt(22, SOL_TCP, TCP_CORK, [1], 4) = 0
[pid  4773] 18:30:23.470173 writev(22, [{"HTTP/1.1 200 OK\r\nDate: Thu, 20
O"..., 381}], 1) = 381
[pid  4773] 18:30:23.470514 sendfile(22, 25, [0], 1943) = 1943
[pid  4773] 18:30:23.470919 setsockopt(22, SOL_TCP, TCP_CORK, [0], 4) = 0
[pid  4773] 18:30:23.471448 close(25)   = 0
[pid  4773] 18:30:23.471784 write(20, "1.1.1.1 - - [20/Oct/2011:"..., 327) =
327
[pid  4773] 18:30:23.472288 times( <unfinished ...>
[pid  4773] 18:30:23.472559 <... times resumed> {tms_utime=4, tms_stime=6,
tms_cutime=0, tms_cstime=0}) = 490760208
[pid  4773] 18:30:23.472735 poll([{fd=22, events=POLLIN}], 1, 15000) = 1
([{fd=22, revents=POLLIN}])
[pid  4773] 18:30:23.473294 read(22,  <unfinished ...>
[pid  4773] 18:30:23.473606 <... read resumed> "", 8000) = 0
[pid  4773] 18:30:23.473803 shutdown(22, 1 /* send */ <unfinished ...>
[pid  4773] 18:30:23.474643 <... shutdown resumed> ) = 0
[pid  4773] 18:30:23.475400 poll([{fd=22, events=POLLIN}], 1, 2000) = 1
([{fd=22, revents=POLLIN|POLLHUP}])
[pid  4773] 18:30:23.475742 read(22, "", 512) = 0
[pid  4773] 18:30:23.477026 close(22)   = 0
[pid  4773] 18:30:23.477354 read(16, 0x7fff3fdfdf07, 1) = -1 EAGAIN
(Resource temporarily unavailable)
[pid  4773] 18:30:23.477690 semop(878542868, 0x2add48c96e00, 1) = 0
[pid  4773] 18:30:23.478032 epoll_wait(18, {{EPOLLIN, {u32=1395089824,
u64=47130071228832}}}, 12, 4294967295) = 1

- Using tcpdump on the proxy server we see delays after the initial PSH from
the client is acknowledged:

17:30:46.088527 IP 2.2.2.2.48193 > www.example.com.au.http: S
1252399013:1252399013(0) win 5840 <mss 1460,sackOK,timestamp 129310579
0,nop,wscale 9>
17:30:46.088541 IP www.example.com.au.http > 2.2.2.2.48193: S
2447174601:2447174601(0) ack 1252399014 win 5792 <mss 1460,sackOK,timestamp
103904641 129310579,nop,wscale 9>
17:30:46.089243 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 1 win 12
<nop,nop,timestamp 129310580 103904641>
17:30:46.089563 IP 2.2.2.2.48193 > www.example.com.au.http: P 1:125(124) ack
1 win 12 <nop,nop,timestamp 129310580 103904641>
17:30:46.089579 IP www.example.com.au.http > 2.2.2.2.48193: . ack 125 win 12
<nop,nop,timestamp 103904642 129310580>
# notice there is a 15 second delay before proxy sends the next ACK
17:31:01.376303 IP www.example.com.au.http > 2.2.2.2.48193: . 1:2897(2896)
ack 125 win 12 <nop,nop,timestamp 103919929 129310580>
17:31:01.376313 IP www.example.com.au.http > 2.2.2.2.48193: .
2897:4345(1448) ack 125 win 12 <nop,nop,timestamp 103919929 129310580>
17:31:01.377045 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 1449 win
18 <nop,nop,timestamp 129325867 103919929>
17:31:01.377056 IP www.example.com.au.http > 2.2.2.2.48193: P
4345:7241(2896) ack 125 win 12 <nop,nop,timestamp 103919930 129325867>
17:31:01.377060 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 2897 win
23 <nop,nop,timestamp 129325867 103919929>
17:31:01.377066 IP www.example.com.au.http > 2.2.2.2.48193: P
7241:10137(2896) ack 125 win 12 <nop,nop,timestamp 103919930 129325867>
17:31:01.377174 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 4345 win
29 <nop,nop,timestamp 129325868 103919929>
17:31:01.377185 IP www.example.com.au.http > 2.2.2.2.48193: .
10137:13033(2896) ack 125 win 12 <nop,nop,timestamp 103919930 129325868>
17:31:01.378146 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 5793 win
35 <nop,nop,timestamp 129325868 103919930>
17:31:01.378154 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 7241 win
40 <nop,nop,timestamp 129325868 103919930>
17:31:01.378159 IP www.example.com.au.http > 2.2.2.2.48193: P
13033:15929(2896) ack 125 win 12 <nop,nop,timestamp 103919931 129325868>
17:31:01.378162 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 8689 win
46 <nop,nop,timestamp 129325868 103919930>
17:31:01.378166 IP www.example.com.au.http > 2.2.2.2.48193: .
15929:18825(2896) ack 125 win 12 <nop,nop,timestamp 103919931 129325868>
17:31:01.378169 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 10137 win
51 <nop,nop,timestamp 129325868 103919930>
17:31:01.378172 IP www.example.com.au.http > 2.2.2.2.48193: P
18825:21721(2896) ack 125 win 12 <nop,nop,timestamp 103919931 129325868>
17:31:01.378176 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 11585 win
57 <nop,nop,timestamp 129325868 103919930>
17:31:01.378179 IP www.example.com.au.http > 2.2.2.2.48193: .
21721:24617(2896) ack 125 win 12 <nop,nop,timestamp 103919931 129325868>
17:31:01.378234 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 13033 win
63 <nop,nop,timestamp 129325868 103919930>
17:31:01.378240 IP www.example.com.au.http > 2.2.2.2.48193: P
24617:27513(2896) ack 125 win 12 <nop,nop,timestamp 103919931 129325868>
17:31:01.380720 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 14481 win
68 <nop,nop,timestamp 129325869 103919931>
17:31:01.380745 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 15929 win
74 <nop,nop,timestamp 129325869 103919931>
17:31:01.380750 IP www.example.com.au.http > 2.2.2.2.48193: .
27513:30409(2896) ack 125 win 12 <nop,nop,timestamp 103919933 129325869>
17:31:01.380755 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 17377 win
80 <nop,nop,timestamp 129325869 103919931>
17:31:01.380759 IP www.example.com.au.http > 2.2.2.2.48193: .
30409:33305(2896) ack 125 win 12 <nop,nop,timestamp 103919933 129325869>
17:31:01.380762 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 18825 win
85 <nop,nop,timestamp 129325869 103919931>
17:31:01.380765 IP www.example.com.au.http > 2.2.2.2.48193: P
33305:36201(2896) ack 125 win 12 <nop,nop,timestamp 103919933 129325869>
17:31:01.380769 IP www.example.com.au.http > 2.2.2.2.48193: .
36201:37649(1448) ack 125 win 12 <nop,nop,timestamp 103919933 129325869>
17:31:01.380772 IP www.example.com.au.http > 2.2.2.2.48193: FP
37649:38782(1133) ack 125 win 12 <nop,nop,timestamp 103919933 129325869>
17:31:01.380774 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 20273 win
91 <nop,nop,timestamp 129325869 103919931>
17:31:01.380778 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 21721 win
97 <nop,nop,timestamp 129325870 103919931>
17:31:01.380782 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 23169 win
102 <nop,nop,timestamp 129325870 103919931>
17:31:01.380785 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 24617 win
108 <nop,nop,timestamp 129325870 103919931>
17:31:01.380788 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 26065 win
114 <nop,nop,timestamp 129325870 103919931>
17:31:01.380791 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 27513 win
119 <nop,nop,timestamp 129325870 103919931>
17:31:01.382032 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 28961 win
125 <nop,nop,timestamp 129325872 103919933>
17:31:01.382041 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 30409 win
126 <nop,nop,timestamp 129325872 103919933>
17:31:01.382046 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 31857 win
126 <nop,nop,timestamp 129325872 103919933>
17:31:01.382049 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 33305 win
126 <nop,nop,timestamp 129325872 103919933>
17:31:01.382053 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 36201 win
126 <nop,nop,timestamp 129325872 103919933>
17:31:01.382057 IP 2.2.2.2.48193 > www.example.com.au.http: . ack 38783 win
126 <nop,nop,timestamp 129325872 103919933>
17:31:01.382114 IP 2.2.2.2.48193 > www.example.com.au.http: F 125:125(0) ack
38783 win 126 <nop,nop,timestamp 129325872 103919933>
17:31:01.382118 IP www.example.com.au.http > 2.2.2.2.48193: . ack 126 win 12
<nop,nop,timestamp 103919935 129325872>

- analysis of netstat -s sampled every second has shown that during the
delayed wgets there are correspondingly a large number of delayed ACKs (i.e.
netstat -s |grep 'delayed acks sent') as well as a correlation with "packets
directly received from backlog".

Trying to understand this latter counter I have found the source code
http://lxr.linux.no/#linux+v2.6.30/net/ipv4/tcp.c#L1519 - I note from the
setsockopt() that Apache is using TCP_CORK.

As far as I can see, in netstat -s, the only two counters that correlate
with my wget slows are the two I've mentioned:
delayed ACKs and packets directly received from backlog.

My network tuning is as follows:

net.core.rmem_max = 16777216
net.core.rmem_default = 262144
net.core.wmem_max = 16777216
net.core.wmem_default = 262144
net.core.netdev_max_backlog = 2500
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.conf.default.send_redirects=0
net.ipv4.conf.all.send_redirects=0
net.ipv4.conf.default.secure_redirects=0
net.ipv4.conf.all.secure_redirects=0
net.ipv4.conf.default.accept_redirects=0
net.ipv4.conf.all.accept_redirects=0
net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.all.rp_filter=1
net.ipv4.tcp_max_syn_backlog=2048

Very occasionally we see frames dropped by the NIC - but presumably not
enough to explain this issue.

If anyone has any idea thoughts are most welcome - I have been grappling
with this for over a week now.

Kind regards,
Alex Harvey

Mime
View raw message