Mailing-List: contact users-help@httpd.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@httpd.apache.org
Received-SPF: neutral (hermes.apache.org: local policy)
Message-ID: <42679D18.2050107@motorweb.co.nz>
Date: Fri, 22 Apr 2005 00:31:20 +1200
From: Damon Maria <damon@motorweb.co.nz>
User-Agent: Mozilla Thunderbird 1.0.2 (Windows/20050317)
MIME-Version: 1.0
To: users@httpd.apache.org
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: [users@httpd] After a couple days httpd child processes go 100% CPU

Sorry for the not very descriptive subject, but this is a hard to 
describe problem...

Some basic details:
- httpd-2.0.54 (although I've had this problem since at least 2.0.49)
- Redhat 9.0 (kernel 2.4.20-8)
- 2.4GHz P4 256M mem.

After some time (usually a couple of days) of running fine and averaging 
about 0.5% CPU usage httpd suddenly goes mad. A particular child process 
will be at close to 100% CPU usage. It will stay like that for a couple 
of seconds before returning to normal CPU usage as another child process 
goes to 100% CPU for a couple seconds. And so on and so on.

A server restart fixes the problem. Once it took 2 restarts. Even a 
graceful restart will do it but results in issues with a global mutex 
lock related to the SSLSessionCache which hasn't been released.

I added in a cronjob to restart the server every night at 5:00am. That 
seems to hold the problem at bay for a while but eventually it happens 
within the 24 hours between restarts (this time it was about 11 hours 
after the cronjob restart). So the length of time it takes to show up 
the problem isn't constant.

This issue has been around for quite a few months now. At first it only 
happened occasionally but now it's now getting more and more frequent - 
I guess tracking with the increase of traffic on our site. I tried 
reverting back to 2.0.49 but that made no difference. I've since 
upgraded with each release but still have the problem. This site has 
been running on apache 2 for a couple of years now.

I've written a script that continually dumps out the result of the 
server-status handler with ExtendedStatus On and runs top in batch mode 
piping it out to a log. Last time the issue happened I ran this script 
so I'd have a record of what CPU usage there was per process 
second-by-second as well as what those processes were doing according to 
httpd.

The output of top shows only one httpd child process is ever running 
100% CPU, and it changes to a different child process every couple of 
seconds. The line from server-status for that process always looks the 
same, something like:

Srv PID Acc M CPU SS Req Conn Child Slot Client VHost Request
68-1 22174 0/1/1 R 0.00 73 578 0.0 0.01 0.01 ? ? ..reading..

As you can see, it has used no CPU and has been sitting there for 73 
seconds handling this request (SS). If I trace this process (68-1 22174) 
through page after page of output from server-status the SS value 
(Seconds since beginning of most recent request) increases but none of 
the other stats change. Strange: the process was listed as close to 100% 
CPU for a couple of seconds by top but this doesn't show up in any of 
the stats from server-status.

As the client, vhost and request are never shown for these runaway 
processes I can't even figure out in which area of our configuration 
these requests are hitting.

By the time I was notified to the problem and got the script running 
there were already ~120 children looking like this and only 7 children 
listed with any request information. From the header of server-status:

RRRRRRRRRRRRRRRWRRRRRRRRRRCRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRKRRRRRRRRRRRRRRRRRRRWWR___RR
RR_____.........................................................
................................................................

Also from the log of top I can see that overall the machine is 0% idle, 
and 0% io wait, but about 85% system and only 15% user. Although I'm not 
sure what (if anything) that means. The memory is close to full but only 
8M of swap is being used, so I don't think that's a problem.

I'd like to run through an aspect of our setup that is not so ordinary, 
as this may give someone a clue as to what's going on. We run SSL on two 
ports: 443 and 444. This is to get around this bug:

http://issues.apache.org/bugzilla/show_bug.cgi?id=12355

Some URLs on 443 require a client certificate, and these always use GET 
rather than POST (because of the above bug). We also proxy another site 
(that is out of our control) which we must secure with client 
certificates. As this site uses POST we have to put this onto a virtual 
host on a different port so we can specify SSLVerifyClient require for 
that entire <virtualhost>, rather than for particular <location>s.

Our (truncated) SSL setup:

SSLSessionCache dbm:/var/log/httpd/ssl_cache
SSLSessionCacheTimeout 600
SSLVerifyDepth 1
SSLCACertificateFile XXXXXXX
SSLVerifyClient none
SSLCipherSuite 
ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL
SSLCertificateFile XXXXX
SSLCertificateKeyFile XXXXX
SSLCertificateChainFile XXXXX
<VirtualHost *:443>
   ServerName www.motorweb.co.nz
   SSLEngine on
   <LocationMatch "^/YYYY/ZZZZ">
     SSLVerifyClient require
     SSLOptions +OptRenegotiate +StdEnvVars
   </LocationMatch>
   ... several other <location>s like the one above
</VirtualHost>
<VirtualHost *:444>
   ServerName www.motorweb.co.nz
   SSLEngine on
   SSLVerifyClient require
   SSLOptions +OptRenegotiate +StdEnvVars
</VirtualHost>

Side note: we used to have only a single IP address which is why we used 
another port rather than a different IP for the proxied site. We now 
have multiple IPs and so will be moving this proxying onto a different 
IP on port 443 at some stage.

To make our configuration even more complex we also have to 
screen-scrape the pages being returned by the proxied site. To do this I 
wrote a perl module that uses PerlOutputFilterHandler (although, to be 
honest, I know close to nothing about perl).

Our perl setup (using mod_perl-2.0.0-RC4):

LoadModule perl_module        modules/mod_perl.so
PerlSwitches -I/opt/motorweb/modperl
PerlModule Apache2
PerlModule Motorweb::WOFFilter
<VirtualHost *:444>
   ...
   <Location /abc/>
     ProxyPass http://XXXXX/
     ProxyPassReverse http://XXXXX/
     PerlOutputFilterHandler Motorweb::WOFFilter
   </Location>
</VirtualHost>

This perl module had been running fine for about 6 months before this 
issue turned up, and hasn't been changed since. So I don't think that 
it's the problem.

I have suspected it may be something to do with SSL as once when this 
happened I could access a page fine through http:// but the same page 
through https:// wouldn't return anything (browser would timeout). 
However, on other occasions even http:// doesn't work (although this may 
be because there just isn't the CPU time available).

Our SSLSessionCache files also get really big (ssl_cache.pag is 
currently around 500MB), and while this issue was happening one time I 
checked and the ssl_cache.dir was suspiciously exactly 65536 bytes. But 
it gets recreated on a restart and always comes back to the same size, 
so I guess they're supposed to be that big? Is there a limit on the size 
of the cache or their files? I've also noticed the occasional line like 
this in the error_log (about once a day, sometimes more, sometimes less):

[Wed Apr 20 16:00:18 2005] [error] (20014)Error string not specified 
yet: Cannot store SSL session to DBM file `/var/log/httpd/ssl_cache'

Also remember my note up the top about when I tried a graceful restart. 
I guess the runaway child processes were still hanging around and 
holding onto a global mutex lock to the SSLSessionCache because I got 
the following lines in the error log:

Mon Apr 04 09:22:54 2005] [notice] Graceful restart requested, doing restart
[Mon Apr 04 09:22:57 2005] [warn] (43)Identifier removed: Failed to 
acquire global mutex lock
[Mon Apr 04 09:22:57 2005] [error] (13)Permission denied: Cannot open 
SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for scanning
[Mon Apr 04 09:22:57 2005] [warn] (22)Invalid argument: Failed to 
release global mutex lock

And then the same last three lines repeated over and over and over. 
Although usually "for scanning" was "for reading", but became almost 
entirely "for writing" near the end, and sometimes the acquire error was 
prefixed by "(22)Invalid argument" rather than "(43)Identifier removed". 
This went on for 6 seconds before clearing itself up. The last bit gets 
interesting and I've added those log lines to the end of this post 
because it also relates to some other stuff I want to tell you about first.

Does this give me a hint that it's something to do with the SSL cache? 
Presumably one of the child processes that had gone wild was holding 
onto this mutex to produce this problem after the restart indicating 
that that is the point it had 'locked up'?

See also my notes below about other SSL problems I've had.

Speaking of the error log, here are the other errors (besides "File does 
not exist") that regularly appear in the logs:

[Thu Apr 21 17:08:24 2005] [error] Re-negotiation handshake failed: Not 
accepted by client!?

I get thousands of these (more than 50 an hour at peek times). SSL does 
work fine and so do our client certificates, our users are the kind that 
tell us if it's not working. I've tried to figure out what this means 
but all other posters on the net with this error get additional error 
messages with it. All I get is that one error over and over again.

Really the only other error that fills the error_log is something along 
the lines of this:

[Wed Jan 05 08:39:44 2005] [error] [client 219.89.127.102] 
Apache::Filter::print: (103) Software caused connection abort at 
/opt/motorweb/modperl/Motorweb/WOFFilter.pm line 159, referer: 
https://www.motorweb.co.nz:444/xxx
Apache::Filter: (103) Software caused connection abort at -e line 0[Wed 
Jan 05 08:42:10 2005] [error] Re-negotiation handshake failed: Not 
accepted by client!?

That's not my email formatting either. "Apache::Filter: (103)..." is on 
a new line but "[Wed Jan 05 08:42:10 2005] [error]..." is *not*. I guess 
something isn't adding newlines into the log properly. I've always 
assumed that this is simply the user refreshing/stopping their browser 
half way through a page being filtered. But I'm not sure. This "Software 
caused connection abort" in my perl filter happens a lot when this 100% 
CPU issue is happening. But that I presume is just because users are 
getting sick of not seeing anything and hitting refresh.

One more thing about my setup. I use the java servlet container Resin 
which has an Apache module mod_caucho. This is a small and simple little 
module that passes requests off to a java based server on another 
machine. I don't think this is the problem because if the request has 
been parsed to the point where httpd can decide to use the handler 
caucho-request then I'd except to be seeing the request URL in the 
server-status output.

Another issue relating to SSL and 443/444.

I did previously strike what I thought may be a bug in httpd with 
running the two different SSL servers inside the one httpd instance. In 
some versions of IE, if a user goes to our https 443 site, but *does 
not* access any locations that use SSLVerifyClient, and then goes to the 
444 site (which is SSLVerifyClient require) then this may cause the 
httpd process they connected to on 444 to go 100% CPU and never come 
back. The process would be like that until I killed it. If I turned on 
more verbose logging I'd see this:

[Mon Mar 29 18:42:00 2004] [info] Connection to child 7 established 
(server www.motorweb.co.nz:443, client 172.20.20.106)
[Mon Mar 29 18:42:00 2004] [info] Seeding PRNG with 0 bytes of entropy
[Mon Mar 29 18:42:00 2004] [info] No acceptable peer certificate available
[Mon Mar 29 18:42:00 2004] [info] Connection to child 7 closed with 
abortive shutdown(server www.motorweb.co.nz:443, client 172.20.20.106)

If the user has been to a SSLVerifyClient require location on 443 first 
before going to 444 then the "No acceptable peer certificate" doesn't 
happen and the httpd process doesn't go mad.

This only happened with specific copies of IE in our office. Even the 
exact same version number may cause it on one computer, but not on 
another. Very weird indeed.

The problem I described up the top is different to this. In the way that 
the CPU usage seems to jump from process to process, where as with this 
443/444 issue it just stuck on one process. But it gets me to wondering. 
Would it be worth me separating out the configuration of the server on 
444 (as it shares very little configuration with the others) and run it 
in a separate httpd instance? I presume running a separate httpd 
instance is no problem?

If you've made it this far, thank you.

So, is it an SSL cache problem, another kind of SSL issue, something 
wrong with my perl script, a lack of memory, a DoS attack??? I just 
can't figure it out. As you can see, I've done a lot of investigation on 
this and am stumped. Any ideas, even suggestions on places I should look 
would help.

thanks,
Damon.


As I promised above, here's the tail bit of the log when I had all the 
mutex errors after a the graceful restart:

[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 
acquire global mutex lock
[Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open 
SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store)
[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 
release global mutex lock
[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 
acquire global mutex lock
[Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open 
SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store)
[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 
release global mutex lock
[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 
acquire global mutex lock
[Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open 
SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store)
[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 
release global mutex lock
[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 
acquire global mutex lock
[Mon Apr 04 09:22:58 2005] [error] (13)Permission denied: Cannot open 
SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for writing (store)
[Mon Apr 04 09:22:58 2005] [warn] (22)Invalid argument: Failed to 
release global mutex lock
[Mon Apr 04 09:22:59 2005] [error] Cannot re-open SSLSessionCache DBM 
file `/var/log/httpd/ssl_cache' for expiring
[Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to 
release global mutex lock
[Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to 
acquire global mutex lock
[Mon Apr 04 09:22:59 2005] [error] (13)Permission denied: Cannot open 
SSLSessionCache DBM file `/var/log/httpd/ssl_cache' for reading (fetch)
[Mon Apr 04 09:22:59 2005] [warn] (22)Invalid argument: Failed to 
release global mutex lock
Apache::Filter: (103) Software caused connection abort at -e line 0[Mon 
Apr 04 09:23:00 2005] [notice] Apache/2.0.53 configured -- resuming 
normal operations
[Mon Apr 04 09:23:01 2005] [warn] (43)Identifier removed: Failed to 
acquire global mutex lock
[Mon Apr 04 09:23:01 2005] [warn] (43)Identifier removed: Failed to 
release global mutex lock
[Mon Apr 04 09:23:01 2005] [error] [client 222.152.162.185] 
Apache::Filter::print: (103) Software caused connection abort at 
/opt/motorweb/modperl/Motorweb/WOFFilter.pm line 159, referer: 
https://www.motorweb.co.nz:444/wofonline/wfpro/default.asp?category=wfpro&service=record_express
Apache::Filter: (103) Software caused connection abort at -e line 0[Mon 
Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to acquire 
global mutex lock
[Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to 
release global mutex lock
[Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to 
acquire global mutex lock
[Mon Apr 04 09:23:03 2005] [warn] (43)Identifier removed: Failed to 
release global mutex lock

And then things carried on normally.

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org