Return-Path:
Apache is a general webserver, which is designed to be
- correct first, and fast second. Even so, its performance is
- quite satisfactory. Most sites have less than 10Mbits of
- outgoing bandwidth, which Apache can fill using only a low end
- Pentium-based webserver. In practice sites with more bandwidth
- require more than one machine to fill the bandwidth due to
- other constraints (such as CGI or database transaction
- overhead). For these reasons the development focus has been
- mostly on correctness and configurability. Apache is a general webserver, which is designed to be correct
+ first, and fast second. Even so, its performance is quite satisfactory.
+ Most sites have less than 10Mbits of outgoing bandwidth, which Apache
+ can fill using only a low end Pentium-based webserver. In practice,
+ sites with more bandwidth require more than one machine to fill the
+ bandwidth due to other constraints (such as CGI or database transaction
+ overhead). For these reasons, the development focus has been mostly on
+ correctness and configurability. Unfortunately many folks overlook these facts and cite raw
- performance numbers as if they are some indication of the
- quality of a web server product. There is a bare minimum
- performance that is acceptable, beyond that extra speed only
- caters to a much smaller segment of the market. But in order to
- avoid this hurdle to the acceptance of Apache in some markets,
- effort was put into Apache 1.3 to bring performance up to a
- point where the difference with other high-end webservers is
- minimal. Finally there are the folks who just plain want to see how
- fast something can go. The author falls into this category. The
- rest of this document is dedicated to these folks who want to
- squeeze every last bit of performance out of Apache's current
- model, and want to understand why it does some things which
- slow it down. Note that this is tailored towards Apache 1.3 on Unix. Some
- of it applies to Apache on NT. Apache on NT has not been tuned
- for performance yet; in fact it probably performs very poorly
- because NT performance requires a different programming
- model.Apache Performance Notes
@@ -20,20 +20,28 @@
-
- Introduction
+ Introduction
-
Finally there are the folks who just want to see how fast something + can go. The author falls into this category. The rest of this document + is dedicated to these folks who want to squeeze every last bit of + performance out of Apache's current model, and want to understand why + it does some things which slow it down.
+ +Note that this is tailored towards Apache 1.3 on Unix. Some of it + applies to Apache on NT. Apache on NT has not been tuned for + performance yet; in fact it probably performs very poorly because NT + performance requires a different programming model.
The single biggest hardware issue affecting webserver
- performance is RAM. A webserver should never ever have to swap,
- swapping increases the latency of each request beyond a point
- that users consider "fast enough". This causes users to hit
- stop and reload, further increasing the load. You can, and
- should, control the MaxClients
setting so that
- your server does not spawn so many children it starts
- swapping.
Beyond that the rest is mundane: get a fast enough CPU, a - fast enough network card, and fast enough disks, where "fast - enough" is something that needs to be determined by - experimentation.
- -Operating system choice is largely a matter of local - concerns. But a general guideline is to always apply the latest - vendor TCP/IP patches. HTTP serving completely breaks many of - the assumptions built into Unix kernels up through 1994 and - even 1995. Good choices include recent FreeBSD, and Linux.
+The single biggest hardware issue affecting webserver performance is
+ RAM. A webserver should never ever have to swap, as swapping increases
+ the latency of each request beyond a point that users consider "fast
+ enough". This causes users to hit stop and reload, further increasing
+ the load. You can, and should, control the MaxClients
+ setting so that your server does not spawn so many children it starts
+ swapping. The procedure for doing this is simple: determine the size of
+ your average Apache process, by looking at your process list via a tool
+ such as top
, and divide this into your total available
+ memory, leaving some room for other processes.
Beyond that the rest is mundane: get a fast enough CPU, a fast + enough network card, and fast enough disks, where "fast enough" is + something that needs to be determined by experimentation.
+ +Operating system choice is largely a matter of local concerns. But a + general guideline is to always apply the latest vendor TCP/IP + patches.
HostnameLookups
and other DNS considerationsPrior to Apache 1.3, HostnameLookups
defaulted
- to On. This adds latency to every request because it requires a
- DNS lookup to complete before the request is finished. In
- Apache 1.3 this setting defaults to Off. However (1.3 or
- later), if you use any Allow from domain
or
- Deny from domain
directives then you will pay for
- a double reverse DNS lookup (a reverse, followed by a forward
- to make sure that the reverse is not being spoofed). So for the
- highest performance avoid using these directives (it's fine to
- use IP addresses rather than domain names).
Note that it's possible to scope the directives, such as
- within a <Location /server-status>
section.
- In this case the DNS lookups are only performed on requests
- matching the criteria. Here's an example which disables lookups
- except for .html and .cgi files:
Prior to Apache 1.3, HostnameLookups
+ defaulted to On
. This adds latency to every request
+ because it requires a DNS lookup to complete before the request is
+ finished. In Apache 1.3 this setting defaults to Off
. If
+ you need to have addresses in your log files resolved to hostnames, use
+ the logresolve program that
+ comes with Apache, or one of the numerous log reporting packages which
+ are available.
It is recommended that you do this sort of postprocessing of your + log files on some machine other than the production web server machine, + in order that this activity not adversely affect server + performance.
+ +If you use any Allow from domain
or
+ Deny from domain
+ directives (i.e., using a hostname, or a domain name, rather than an IP
+ address) then you will pay for a double reverse DNS lookup (a reverse,
+ followed by a forward to make sure that the reverse is not being
+ spoofed). For best performance, therefore, use IP addresses, rather
+ than names, when using these directives, if possible.
Note that it's possible to scope the directives, such as within a
+ <Location /server-status>
section. In this case the
+ DNS lookups are only performed on requests matching the criteria.
+ Here's an example which disables lookups except for .html and .cgi
+ files:
- But even still, if you just need DNS names in some CGIs you - could consider doing the@@ -134,27 +149,18 @@ </Files>
gethostbyname
call in the
- specific CGIs that need it.
-
- Similarly, if you need to have hostname information in your - server logs in order to generate reports of this information, - you can postprocess your log file with logresolve, so that - these lookups can be done without making the client wait. It is - recommended that you do this postprocessing, and any other - statistical analysis of the log file, somewhere other than your - production web server machine, in order that this activity does - not adversely affect server performance.
-But even still, if you just need DNS names in some CGIs you could
+ consider doing the gethostbyname
call in the specific CGIs
+ that need it.
Wherever in your URL-space you do not have an Options
FollowSymLinks
, or you do have an Options
- SymLinksIfOwnerMatch
Apache will have to issue extra
- system calls to check up on symlinks. One extra call per
- filename component. For example, if you had:
- and a request is made for the URI@@ -164,13 +170,13 @@ </Directory>
/index.html
.
- Then Apache will perform lstat(2)
on
- /www
, /www/htdocs
, and
- /www/htdocs/index.html
. The results of these
- lstats
are never cached, so they will occur on
- every single request. If you really desire the symlinks
- security checking you can do something like this:
+
+ and a request is made for the URI /index.html
. Then
+ Apache will perform lstat(2)
on /www
,
+ /www/htdocs
, and /www/htdocs/index.html
. The
+ results of these lstats
are never cached, so they will
+ occur on every single request. If you really desire the symlinks
+ security checking you can do something like this:
- This at least avoids the extra checks for the -@@ -183,20 +189,19 @@ </Directory>
DocumentRoot
path. Note that you'll need to add
- similar sections if you have any Alias
or
- RewriteRule
paths outside of your document root.
- For highest performance, and no symlink protection, set
- FollowSymLinks
everywhere, and never set
- SymLinksIfOwnerMatch
.
- This at least avoids the extra checks for the
+ DocumentRoot
path. Note that you'll need to add similar
+ sections if you have any Alias
or RewriteRule
+ paths outside of your document root. For highest performance, and no
+ symlink protection, set FollowSymLinks
everywhere, and
+ never set SymLinksIfOwnerMatch
.
Wherever in your URL-space you allow overrides (typically
.htaccess
files) Apache will attempt to open
- .htaccess
for each filename component. For
- example,
.htaccess
for each filename component. For example,
- and a request is made for the URI@@ -206,118 +211,183 @@ </Directory>
/index.html
.
- Then Apache will attempt to open /.htaccess
,
- /www/.htaccess
, and
- /www/htdocs/.htaccess
. The solutions are similar
- to the previous case of Options FollowSymLinks
.
- For highest performance use AllowOverride None
- everywhere in your filesystem.
-
- If at all possible, avoid content-negotiation if you're - really interested in every last ounce of performance. In - practice the benefits of negotiation outweigh the performance - penalties. There's one case where you can speed up the server. - Instead of using a wildcard such as:
+ +and a request is made for the URI /index.html
. Then
+ Apache will attempt to open /.htaccess
,
+ /www/.htaccess
, and /www/htdocs/.htaccess
.
+ The solutions are similar to the previous case of Options
+ FollowSymLinks
. For highest performance use AllowOverride
+ None
everywhere in your filesystem.
See also the .htaccess tutorial + for further discussion of this.
+ +If at all possible, avoid content-negotiation if you're really + interested in every last ounce of performance. In practice the benefits + of negotiation outweigh the performance penalties. There's one case + where you can speed up the server. Instead of using a wildcard such + as:
- Use a complete list of options: + +DirectoryIndex index
Use a complete list of options:
- where you list the most common choice first. -DirectoryIndex index.cgi index.pl index.shtml index.html
where you list the most common choice first.
-Prior to Apache 1.3 the MinSpareServers
,
- MaxSpareServers
, and StartServers
- settings all had drastic effects on benchmark results. In
- particular, Apache required a "ramp-up" period in order to
- reach a number of children sufficient to serve the load being
- applied. After the initial spawning of
- StartServers
children, only one child per second
- would be created to satisfy the MinSpareServers
- setting. So a server being accessed by 100 simultaneous
- clients, using the default StartServers
of 5 would
- take on the order 95 seconds to spawn enough children to handle
- the load. This works fine in practice on real-life servers,
- because they aren't restarted frequently. But does really
- poorly on benchmarks which might only run for ten minutes.
The one-per-second rule was implemented in an effort to - avoid swamping the machine with the startup of new children. If - the machine is busy spawning children it can't service - requests. But it has such a drastic effect on the perceived - performance of Apache that it had to be replaced. As of Apache - 1.3, the code will relax the one-per-second rule. It will spawn - one, wait a second, then spawn two, wait a second, then spawn - four, and it will continue exponentially until it is spawning - 32 children per second. It will stop whenever it satisfies the +
If your site needs content negotiation, consider using
+ type-map
files rather than the Options
+ MultiViews
directive to accomplish the negotiation. See the Content Negotiation
+ documentation for a full discussion of the methods of negotiation, and
+ instructions for creating type-map
files.
Prior to Apache 1.3 the MinSpareServers
,
+ MaxSpareServers
,
+ and StartServers
+ settings all had drastic effects on benchmark results. In particular,
+ Apache required a "ramp-up" period in order to reach a number of
+ children sufficient to serve the load being applied. After the initial
+ spawning of StartServers
children, only one child per
+ second would be created to satisfy the MinSpareServers
+ setting. So a server being accessed by 100 simultaneous clients, using
+ the default StartServers
of 5 would take on the order 95
+ seconds to spawn enough children to handle the load. This works fine in
+ practice on real-life servers, because they aren't restarted
+ frequently. But results in poor performance on benchmarks, which might
+ only run for ten minutes.
The one-per-second rule was implemented in an effort to avoid
+ swamping the machine with the startup of new children. If the machine
+ is busy spawning children it can't service requests. But it has such a
+ drastic effect on the perceived performance of Apache that it had to be
+ replaced. As of Apache 1.3, the code will relax the one-per-second
+ rule. It will spawn one, wait a second, then spawn two, wait a second,
+ then spawn four, and it will continue exponentially until it is
+ spawning 32 children per second. It will stop whenever it satisfies the
MinSpareServers
setting.
This appears to be responsive enough that it's almost
- unnecessary to twiddle the MinSpareServers
,
- MaxSpareServers
and StartServers
- knobs. When more than 4 children are spawned per second, a
- message will be emitted to the ErrorLog
. If you
- see a lot of these errors then consider tuning these settings.
- Use the mod_status
output as a guide.
This appears to be responsive enough that it's almost unnecessary to
+ adjust the MinSpareServers
, MaxSpareServers
+ and StartServers
settings. When more than 4 children are
+ spawned per second, a message will be emitted to the
+ ErrorLog
. If you see a lot of these errors then consider
+ tuning these settings. Use the mod_status
output as a
+ guide.
In particular, you may neet to set MinSpareServers
+ higher if traffic on your site is extremely bursty - that is, if the
+ number of connections to your site fluctuates radically in short
+ periods of time. This may be the case, for example, if traffic to your
+ site is highly event-driven, such as sites for major sports events, or
+ other sites where users are encouraged to visit the site at a
+ particular time.
Related to process creation is process death induced by the
- MaxRequestsPerChild
setting. By default this is 0,
- which means that there is no limit to the number of requests
- handled per child. If your configuration currently has this set
- to some very low number, such as 30, you may want to bump this
- up significantly. If you are running SunOS or an old version of
- Solaris, limit this to 10000 or so because of memory leaks.
When keep-alives are in use, children will be kept busy
- doing nothing waiting for more requests on the already open
- connection. The default KeepAliveTimeout
of 15
- seconds attempts to minimize this effect. The tradeoff here is
- between network bandwidth and server resources. In no event
- should you raise this above about 60 seconds, as MaxRequestsPerChild setting. By default this is 0, which
+ means that there is no limit to the number of requests handled per
+ child. If your configuration currently has this set to some very low
+ number, such as 30, you may want to bump this up significantly. If you
+ are running SunOS or an old version of Solaris, limit this to 10000 or
+ so because of memory leaks.
When keep-alives are in use, children will be kept busy doing
+ nothing waiting for more requests on the already open connection. The
+ default KeepAliveTimeout
of 15 seconds attempts to
+ minimize this effect. The tradeoff here is between network bandwidth
+ and server resources. In no event should you raise this above about 60
+ seconds, as
most of the benefits are lost.
Since memory usage is such an important consideration in + performance, you should attempt to eliminate modules that you are not + actually using. If you have built the modules as DSOs, eliminating modules is a simple matter of + commenting out the associated AddModule and LoadModule directives for + that module. This allows you to experiment with removing modules, and + seeing if your site still functions in their absense.
+ +If, on the other hand, you have modules statically linked into your + Apache binary, you will need to recompile Apache in order to remove + unwanted modules.
+ +An associated question that arises here is, of course, what modules
+ you need, and which ones you don't. The answer here will, of course,
+ vary from one web site to another. However, the minimal list of
+ modules which you can get by with tends to include mod_mime, mod_dir, and mod_log_config.
+ mod_log_config
is, of course, optional, as you can run a
+ web site without log files. This is, however, not recommended.
Apache comes with a module, mod_mmap_static, which is not + enabled by default, which allows you to map files into RAM, and + serve them directly from memory rather than from the disc, which + should result in substantial performance improvement for + frequently-requests files. Note that when files are modified, you + will need to restart your server in order to serve the latest + version of the file, so this is not appropriate for files which + change frequently. See the documentation for this module for more + complete details.
+If you include mod_status
and you also set
- ExtendedStatus On
when building and running
- Apache, then on every request Apache will perform two calls to
- gettimeofday(2)
(or times(2)
- depending on your operating system), and (pre-1.3) several
- extra calls to time(2)
. This is all done so that
- the status report contains timing indications. For highest
- performance, set ExtendedStatus off
(which is the
- default).
If you include mod_status
and you also
+ set ExtendedStatus On
when building and running Apache,
+ then on every request Apache will perform two calls to
+ gettimeofday(2)
(or times(2)
depending on
+ your operating system), and (pre-1.3) several extra calls to
+ time(2)
. This is all done so that the status report
+ contains timing indications. For highest performance, set
+ ExtendedStatus off
(which is the default).
mod_status
should probably be configured to allow
+ access by only a few users, rather than to the general public, so this
+ will likely have very low impact on your overall performance.
This discusses a shortcoming in the Unix socket API. Suppose
- your web server uses multiple Listen
statements to
- listen on either multiple ports or multiple addresses. In order
- to test each socket to see if a connection is ready Apache uses
- select(2)
. select(2)
indicates that a
- socket has zero or at least one connection
- waiting on it. Apache's model includes multiple children, and
- all the idle ones test for new connections at the same time. A
- naive implementation looks something like this (these examples
- do not match the code, they're contrived for pedagogical
- purposes):
This discusses a shortcoming in the Unix socket API. Suppose your
+ web server uses multiple Listen
statements to listen on
+ either multiple ports or multiple addresses. In order to test each
+ socket to see if a connection is ready Apache uses
+ select(2)
. select(2)
indicates that a socket
+ has zero or at least one connection waiting on it.
+ Apache's model includes multiple children, and all the idle ones test
+ for new connections at the same time. A naive implementation looks
+ something like this (these examples do not match the code, they're
+ contrived for pedagogical purposes):
- But this naive implementation has a serious starvation problem. - Recall that multiple children execute this loop at the same - time, and so multiple children will block at -@@ -344,42 +414,37 @@ }
select
when they are in between requests. All
- those blocked children will awaken and return from
- select
when a single request appears on any socket
- (the number of children which awaken varies depending on the
- operating system and timing issues). They will all then fall
- down into the loop and try to accept
the
- connection. But only one will succeed (assuming there's still
- only one connection ready), the rest will be blocked
- in accept
. This effectively locks those children
- into serving requests from that one socket and no other
- sockets, and they'll be stuck there until enough new requests
- appear on that socket to wake them all up. This starvation
- problem was first documented in PR#467. There
- are at least two solutions.
-
- One solution is to make the sockets non-blocking. In this
- case the accept
won't block the children, and they
- will be allowed to continue immediately. But this wastes CPU
- time. Suppose you have ten idle children in
- select
, and one connection arrives. Then nine of
- those children will wake up, try to accept
the
- connection, fail, and loop back into select
,
- accomplishing nothing. Meanwhile none of those children are
- servicing requests that occurred on other sockets until they
- get back up to the select
again. Overall this
- solution does not seem very fruitful unless you have as many
- idle CPUs (in a multiprocessor box) as you have idle children,
- not a very likely situation.
Another solution, the one used by Apache, is to serialize - entry into the inner loop. The loop looks like this - (differences highlighted):
+ But this naive implementation has a serious starvation problem. Recall + that multiple children execute this loop at the same time, and so + multiple children will block atselect
when they are in
+ between requests. All those blocked children will awaken and return
+ from select
when a single request appears on any socket
+ (the number of children which awaken varies depending on the operating
+ system and timing issues). They will all then fall down into the loop
+ and try to accept
the connection. But only one will
+ succeed (assuming there's still only one connection ready), the rest
+ will be blocked in accept
. This effectively locks
+ those children into serving requests from that one socket and no other
+ sockets, and they'll be stuck there until enough new requests appear on
+ that socket to wake them all up. This starvation problem was first
+ documented in PR#467. There are at
+ least two solutions.
+
+ One solution is to make the sockets non-blocking. In this case the
+ accept
won't block the children, and they will be allowed
+ to continue immediately. But this wastes CPU time. Suppose you have ten
+ idle children in select
, and one connection arrives. Then
+ nine of those children will wake up, try to accept
the
+ connection, fail, and loop back into select
, accomplishing
+ nothing. Meanwhile none of those children are servicing requests that
+ occurred on other sockets until they get back up to the
+ select
again. Overall this solution does not seem very
+ fruitful unless you have as many idle CPUs (in a multiprocessor box) as
+ you have idle children, not a very likely situation.
Another solution, the one used by Apache, is to serialize entry into + the inner loop. The loop looks like this (differences highlighted):
The functions@@ -410,158 +475,141 @@
accept_mutex_on
and accept_mutex_off
- implement a mutual exclusion semaphore. Only one child can have
- the mutex at any time. There are several choices for
- implementing these mutexes. The choice is defined in
- src/conf.h
(pre-1.3) or
- src/include/ap_config.h
(1.3 or later). Some
- architectures do not have any locking choice made, on these
- architectures it is unsafe to use multiple Listen
- directives.
+ implement a mutual exclusion semaphore. Only one child can have the
+ mutex at any time. There are several choices for implementing these
+ mutexes. The choice is defined in src/conf.h
(pre-1.3) or
+ src/include/ap_config.h
(1.3 or later). Some architectures
+ do not have any locking choice made, on these architectures it is
+ unsafe to use multiple Listen
directives.
HAVE_FLOCK_SERIALIZED_ACCEPT
flock(2)
system call to
- lock a lock file (located by the LockFile
- directive).flock(2)
system call to lock a
+ lock file (located by the LockFile
directive).HAVE_FCNTL_SERIALIZED_ACCEPT
fcntl(2)
system call to
- lock a lock file (located by the LockFile
- directive).fcntl(2)
system call to lock a
+ lock file (located by the LockFile
directive).HAVE_SYSVSEM_SERIALIZED_ACCEPT
ipcs(8)
man page). The other is that the
- semaphore API allows for a denial of service attack by any
- CGIs running under the same uid as the webserver
- (i.e., all CGIs, unless you use something like
- suexec or cgiwrapper). For these reasons this method is not
- used on any architecture except IRIX (where the previous two
- are prohibitively expensive on most IRIX boxes).ipcs(8)
man page).
+ The other is that the semaphore API allows for a denial of service
+ attack by any CGIs running under the same uid as the webserver
+ (i.e., all CGIs, unless you use something like suexec or
+ cgiwrapper). For these reasons this method is not used on any
+ architecture except IRIX (where the previous two are prohibitively
+ expensive on most IRIX boxes).
HAVE_USLOCK_SERIALIZED_ACCEPT
usconfig(2)
to create a mutex. While this
- method avoids the hassles of SysV-style semaphores, it is not
- the default for IRIX. This is because on single processor
- IRIX boxes (5.3 or 6.2) the uslock code is two orders of
- magnitude slower than the SysV-semaphore code. On
- multi-processor IRIX boxes the uslock code is an order of
- magnitude faster than the SysV-semaphore code. Kind of a
- messed up situation. So if you're using a multiprocessor IRIX
- box then you should rebuild your webserver with
+ usconfig(2)
to create a mutex. While this method avoids
+ the hassles of SysV-style semaphores, it is not the default for IRIX.
+ This is because on single processor IRIX boxes (5.3 or 6.2) the
+ uslock code is two orders of magnitude slower than the SysV-semaphore
+ code. On multi-processor IRIX boxes the uslock code is an order of
+ magnitude faster than the SysV-semaphore code. Kind of a messed up
+ situation. So if you're using a multiprocessor IRIX box then you
+ should rebuild your webserver with
-DHAVE_USLOCK_SERIALIZED_ACCEPT
on the
EXTRA_CFLAGS
.HAVE_PTHREAD_SERIALIZED_ACCEPT
If your system has another method of serialization which
- isn't in the above list then it may be worthwhile adding code
- for it (and submitting a patch back to Apache). The above
- HAVE_METHOD_SERIALIZED_ACCEPT
defines specify
- which method is available and works on the platform (you can
- have more than one); USE_METHOD_SERIALIZED_ACCEPT
- is used to specify the default method (see the
- AcceptMutex
directive).
Another solution that has been considered but never - implemented is to partially serialize the loop -- that is, let - in a certain number of processes. This would only be of - interest on multiprocessor boxes where it's possible multiple - children could run simultaneously, and the serialization - actually doesn't take advantage of the full bandwidth. This is - a possible area of future investigation, but priority remains +
If your system has another method of serialization which isn't in
+ the above list then it may be worthwhile adding code for it (and
+ submitting a patch back to Apache). The above
+ HAVE_METHOD_SERIALIZED_ACCEPT
defines specify which method
+ is available and works on the platform (you can have more than one);
+ USE_METHOD_SERIALIZED_ACCEPT
is used to specify the
+ default method (see the AcceptMutex
directive).
Another solution that has been considered but never implemented is + to partially serialize the loop -- that is, let in a certain number of + processes. This would only be of interest on multiprocessor boxes where + it's possible multiple children could run simultaneously, and the + serialization actually doesn't take advantage of the full bandwidth. + This is a possible area of future investigation, but priority remains low because highly parallel web servers are not the norm.
-Ideally you should run servers without multiple
- Listen
statements if you want the highest
- performance. But read on.
Ideally you should run servers without multiple Listen
+ statements if you want the highest performance. But read on.
The above is fine and dandy for multiple socket servers, but
- what about single socket servers? In theory they shouldn't
- experience any of these same problems because all children can
- just block in accept(2)
until a connection
- arrives, and no starvation results. In practice this hides
- almost the same "spinning" behavior discussed above in the
- non-blocking solution. The way that most TCP stacks are
- implemented, the kernel actually wakes up all processes blocked
- in accept
when a single connection arrives. One of
- those processes gets the connection and returns to user-space,
- the rest spin in the kernel and go back to sleep when they
- discover there's no connection for them. This spinning is
- hidden from the user-land code, but it's there nonetheless.
- This can result in the same load-spiking wasteful behavior
- that a non-blocking solution to the multiple sockets case
- can.
For this reason we have found that many architectures behave - more "nicely" if we serialize even the single socket case. So - this is actually the default in almost all cases. Crude - experiments under Linux (2.0.30 on a dual Pentium pro 166 - w/128Mb RAM) have shown that the serialization of the single - socket case causes less than a 3% decrease in requests per - second over unserialized single-socket. But unserialized - single-socket showed an extra 100ms latency on each request. - This latency is probably a wash on long haul lines, and only an - issue on LANs. If you want to override the single socket +
The above is fine and dandy for multiple socket servers, but what
+ about single socket servers? In theory they shouldn't experience any of
+ these same problems because all children can just block in
+ accept(2)
until a connection arrives, and no starvation
+ results. In practice this hides almost the same "spinning" behavior
+ discussed above in the non-blocking solution. The way that most TCP
+ stacks are implemented, the kernel actually wakes up all processes
+ blocked in accept
when a single connection arrives. One of
+ those processes gets the connection and returns to user-space, the rest
+ spin in the kernel and go back to sleep when they discover there's no
+ connection for them. This spinning is hidden from the user-land code,
+ but it's there nonetheless. This can result in the same load-spiking
+ wasteful behavior that a non-blocking solution to the multiple sockets
+ case can.
For this reason we have found that many architectures behave more
+ "nicely" if we serialize even the single socket case. So this is
+ actually the default in almost all cases. Crude experiments under Linux
+ (2.0.30 on a dual Pentium pro 166 w/128Mb RAM) have shown that the
+ serialization of the single socket case causes less than a 3% decrease
+ in requests per second over unserialized single-socket. But
+ unserialized single-socket showed an extra 100ms latency on each
+ request. This latency is probably a wash on long haul lines, and only
+ an issue on LANs. If you want to override the single socket
serialization you can define
- SINGLE_LISTEN_UNSERIALIZED_ACCEPT
and then
- single-socket servers will not serialize at all.
SINGLE_LISTEN_UNSERIALIZED_ACCEPT
and then single-socket
+ servers will not serialize at all.
As discussed in - draft-ietf-http-connection-00.txt section 8, in order for - an HTTP server to reliably implement the - protocol it needs to shutdown each direction of the - communication independently (recall that a TCP connection is - bi-directional, each half is independent of the other). This - fact is often overlooked by other servers, but is correctly - implemented in Apache as of 1.2.
- -When this feature was added to Apache it caused a flurry of - problems on various versions of Unix because of a - shortsightedness. The TCP specification does not state that the - FIN_WAIT_2 state has a timeout, but it doesn't prohibit it. On - systems without the timeout, Apache 1.2 induces many sockets - stuck forever in the FIN_WAIT_2 state. In many cases this can - be avoided by simply upgrading to the latest TCP/IP patches - supplied by the vendor. In cases where the vendor has never - released patches (i.e., SunOS4 -- although folks with - a source license can patch it themselves) we have decided to - disable this feature.
- -There are two ways of accomplishing this. One is the socket
- option SO_LINGER
. But as fate would have it, this
- has never been implemented properly in most TCP/IP stacks. Even
- on those stacks with a proper implementation (i.e.,
- Linux 2.0.31) this method proves to be more expensive (cputime)
- than the next solution.
For the most part, Apache implements this in a function
- called lingering_close
(in
- http_main.c
). The function looks roughly like
- this:
When this feature was added to Apache it caused a flurry of problems + on various versions of Unix because of a shortsightedness. The TCP + specification does not state that the FIN_WAIT_2 state has a timeout, + but it doesn't prohibit it. On systems without the timeout, Apache 1.2 + induces many sockets stuck forever in the FIN_WAIT_2 state. In many + cases this can be avoided by simply upgrading to the latest TCP/IP + patches supplied by the vendor. In cases where the vendor has never + released patches (i.e., SunOS4 -- although folks with a source + license can patch it themselves) we have decided to disable this + feature.
+ +There are two ways of accomplishing this. One is the socket option
+ SO_LINGER
. But as fate would have it, this has never been
+ implemented properly in most TCP/IP stacks. Even on those stacks with a
+ proper implementation (i.e., Linux 2.0.31) this method proves
+ to be more expensive (cputime) than the next solution.
For the most part, Apache implements this in a function called
+ lingering_close
(in http_main.c
). The
+ function looks roughly like this:
- This naturally adds some expense at the end of a connection, - but it is required for a reliable implementation. As HTTP/1.1 - becomes more prevalent, and all connections are persistent, - this expense will be amortized over more requests. If you want - to play with fire and disable this feature you can define -@@ -590,51 +638,47 @@ }
NO_LINGCLOSE
, but this is not recommended at all.
- In particular, as HTTP/1.1 pipelined persistent connections
- come into use lingering_close
is an absolute
+ This naturally adds some expense at the end of a connection, but it is
+ required for a reliable implementation. As HTTP/1.1 becomes more
+ prevalent, and all connections are persistent, this expense will be
+ amortized over more requests. If you want to play with fire and disable
+ this feature you can define NO_LINGCLOSE
, but this is not
+ recommended at all. In particular, as HTTP/1.1 pipelined persistent
+ connections come into use lingering_close
is an absolute
necessity (and
- pipelined connections are faster, so you want to support
- them).
+ href="http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html">pipelined
+ connections are faster, so you want to support them).
Apache's parent and children communicate with each other
- through something called the scoreboard. Ideally this should be
- implemented in shared memory. For those operating systems that
- we either have access to, or have been given detailed ports
- for, it typically is implemented using shared memory. The rest
- default to using an on-disk file. The on-disk file is not only
- slow, but it is unreliable (and less featured). Peruse the
- src/main/conf.h
file for your architecture and
- look for either USE_MMAP_SCOREBOARD
or
- USE_SHMGET_SCOREBOARD
. Defining one of those two
- (as well as their companions HAVE_MMAP
and
- HAVE_SHMGET
respectively) enables the supplied
- shared memory code. If your system has another type of shared
- memory, edit the file src/main/http_main.c
and add
- the hooks necessary to use it in Apache. (Send us back a patch
- too please.)
Historical note: The Linux port of Apache didn't start to - use shared memory until version 1.2 of Apache. This oversight - resulted in really poor and unreliable behavior of earlier - versions of Apache on Linux.
+Apache's parent and children communicate with each other through
+ something called the scoreboard. Ideally this should be implemented in
+ shared memory. For those operating systems that we either have access
+ to, or have been given detailed ports for, it typically is implemented
+ using shared memory. The rest default to using an on-disk file. The
+ on-disk file is not only slow, but it is unreliable (and less
+ featured). Peruse the src/main/conf.h
file for your
+ architecture and look for either USE_MMAP_SCOREBOARD
or
+ USE_SHMGET_SCOREBOARD
. Defining one of those two (as well
+ as their companions HAVE_MMAP
and HAVE_SHMGET
+ respectively) enables the supplied shared memory code. If your system
+ has another type of shared memory, edit the file
+ src/main/http_main.c
and add the hooks necessary to use it
+ in Apache. (Send us back a patch too please.)
Historical note: The Linux port of Apache didn't start to use shared + memory until version 1.2 of Apache. This oversight resulted in really + poor and unreliable behavior of earlier versions of Apache on + Linux.
DYNAMIC_MODULE_LIMIT
If you have no intention of using dynamically loaded modules
- (you probably don't if you're reading this and tuning your
- server for every last ounce of performance) then you should add
- -DDYNAMIC_MODULE_LIMIT=0
when building your
- server. This will save RAM that's allocated only for supporting
- dynamically loaded modules.
If you have no intention of using dynamically loaded modules (you
+ probably don't if you're reading this and tuning your server for every
+ last ounce of performance) then you should add
+ -DDYNAMIC_MODULE_LIMIT=0
when building your server. This
+ will save RAM that's allocated only for supporting dynamically loaded
+ modules.
strace
program, other
- similar programs include truss
,
- ktrace
, and par
.)
+ The file being requested is a static 6K file of no particular content.
+ Traces of non-static requests or requests with content negotiation look
+ wildly different (and quite ugly in some cases). First the entire
+ trace, then we'll examine details. (This was generated by the
+ strace
program, other similar programs include
+ truss
, ktrace
, and par
.)
These two calls can be removed by defining -@@ -698,8 +741,7 @@
SINGLE_LISTEN_UNSERIALIZED_ACCEPT
as described
- earlier.
+ SINGLE_LISTEN_UNSERIALIZED_ACCEPT
as described earlier.
Notice the SIGUSR1
manipulation:
SIGUSR1
it sends a
- SIGUSR1
to all of its children (and it also
- increments a "generation counter" in shared memory). Any
- children that are idle (between connections) will immediately
- die off when they receive the signal. Any children that are in
- keep-alive connections, but are in between requests will die
- off immediately. But any children that have a connection and
- are still waiting for the first request will not die off
- immediately.
-
- To see why this is necessary, consider how a browser reacts
- to a closed connection. If the connection was a keep-alive
- connection and the request being serviced was not the first
- request then the browser will quietly reissue the request on a
- new connection. It has to do this because the server is always
- free to close a keep-alive connection in between requests
- (i.e., due to a timeout or because of a maximum number
- of requests). But, if the connection is closed before the first
- response has been received the typical browser will display a
- "document contains no data" dialogue (or a broken image icon).
- This is done on the assumption that the server is broken in
- some way (or maybe too overloaded to respond at all). So Apache
- tries to avoid ever deliberately closing the connection before
- it has sent a single response. This is the cause of those
- SIGUSR1
manipulations.
Note that it is theoretically possible to eliminate all - three of these calls. But in rough tests the gain proved to be - almost unnoticeable.
+ This is caused by the implementation of graceful restarts. When the + parent receives aSIGUSR1
it sends a SIGUSR1
+ to all of its children (and it also increments a "generation counter"
+ in shared memory). Any children that are idle (between connections)
+ will immediately die off when they receive the signal. Any children
+ that are in keep-alive connections, but are in between requests will
+ die off immediately. But any children that have a connection and are
+ still waiting for the first request will not die off immediately.
+
+ To see why this is necessary, consider how a browser reacts to a
+ closed connection. If the connection was a keep-alive connection and
+ the request being serviced was not the first request then the browser
+ will quietly reissue the request on a new connection. It has to do this
+ because the server is always free to close a keep-alive connection in
+ between requests (i.e., due to a timeout or because of a
+ maximum number of requests). But, if the connection is closed before
+ the first response has been received the typical browser will display a
+ "document contains no data" dialogue (or a broken image icon). This is
+ done on the assumption that the server is broken in some way (or maybe
+ too overloaded to respond at all). So Apache tries to avoid ever
+ deliberately closing the connection before it has sent a single
+ response. This is the cause of those SIGUSR1
+ manipulations.
Note that it is theoretically possible to eliminate all three of + these calls. But in rough tests the gain proved to be almost + unnoticeable.
-In order to implement virtual hosts, Apache needs to know - the local socket address used to accept the connection:
+In order to implement virtual hosts, Apache needs to know the local + socket address used to accept the connection:
- It is possible to eliminate this call in many situations (such - as when there are no virtual hosts, or whengetsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
Listen
- directives are used which do not have wildcard addresses). But
- no effort has yet been made to do these optimizations.
+ It is possible to eliminate this call in many situations (such as when
+ there are no virtual hosts, or when Listen
directives are
+ used which do not have wildcard addresses). But no effort has yet been
+ made to do these optimizations.
Apache turns off the Nagle algorithm:
@@ -764,8 +803,8 @@ because of problems described in a - paper by John Heidemann. + href="http://www.isi.edu/~johnh/PAPERS/Heidemann97a.html">a paper by + John Heidemann.Notice the two time
calls:
As described earlier, ExtendedStatus On
causes
- two gettimeofday
calls and a call to
- times
:
As described earlier, ExtendedStatus On
causes two
+ gettimeofday
calls and a call to times
:
- These can be removed by setting@@ -797,8 +835,8 @@ times({tms_utime=5, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 446747
ExtendedStatus Off
- (which is the default).
+ These can be removed by setting ExtendedStatus Off
(which
+ is the default).
It might seem odd to call stat
:
PATH_INFO
for use by CGIs. In fact if the request
- had been for the URI /cgi-bin/printenv/foobar
then
- there would be two calls to stat
. The first for
- /home/dgaudet/ap/apachen/cgi-bin/printenv/foobar
- which does not exist, and the second for
- /home/dgaudet/ap/apachen/cgi-bin/printenv
, which
- does exist. Regardless, at least one stat
call is
- necessary when serving static files because the file size and
- modification times are used to generate HTTP headers (such as
- Content-Length
, Last-Modified
) and
- implement protocol features (such as
- If-Modified-Since
). A somewhat more clever server
- could avoid the stat
when serving non-static
- files, however doing so in Apache is very difficult given the
- modular structure.
+ PATH_INFO
for use by CGIs. In fact if the request had been
+ for the URI /cgi-bin/printenv/foobar
then there would be
+ two calls to stat
. The first for
+ /home/dgaudet/ap/apachen/cgi-bin/printenv/foobar
which
+ does not exist, and the second for
+ /home/dgaudet/ap/apachen/cgi-bin/printenv
, which does
+ exist. Regardless, at least one stat
call is necessary
+ when serving static files because the file size and modification times
+ are used to generate HTTP headers (such as Content-Length
,
+ Last-Modified
) and implement protocol features (such as
+ If-Modified-Since
). A somewhat more clever server could
+ avoid the stat
when serving non-static files, however
+ doing so in Apache is very difficult given the modular structure.
All static files are served using mmap
:
mmap
small
- files than it is to simply read
them. The define
- MMAP_THRESHOLD
can be set to the minimum size
- required before using mmap
. By default it's set to
- 0 (except on SunOS4 where experimentation has shown 8192 to be
- a better value). Using a tool such as lmbench you can
- determine the optimal setting for your environment.
-
- You may also wish to experiment with
- MMAP_SEGMENT_SIZE
(default 32768) which determines
- the maximum number of bytes that will be written at a time from
- mmap()d files. Apache only resets the client's
- Timeout
in between write()s. So setting this large
- may lock out low bandwidth clients unless you also increase the
+ On some architectures it's slower to mmap
small files than
+ it is to simply read
them. The define
+ MMAP_THRESHOLD
can be set to the minimum size required
+ before using mmap
. By default it's set to 0 (except on
+ SunOS4 where experimentation has shown 8192 to be a better value).
+ Using a tool such as lmbench you can determine
+ the optimal setting for your environment.
+
+
You may also wish to experiment with MMAP_SEGMENT_SIZE
+ (default 32768) which determines the maximum number of bytes that will
+ be written at a time from mmap()d files. Apache only resets the
+ client's Timeout
in between write()s. So setting this
+ large may lock out low bandwidth clients unless you also increase the
Timeout
.
It may even be the case that mmap
isn't used on
- your architecture; if so then defining
- USE_MMAP_FILES
and HAVE_MMAP
might
- work (if it works then report back to us).
Apache does its best to avoid copying bytes around in
- memory. The first write of any request typically is turned into
- a writev
which combines both the headers and the
- first hunk of data:
It may even be the case that mmap
isn't used on your
+ architecture; if so then defining USE_MMAP_FILES
and
+ HAVE_MMAP
might work (if it works then report back to
+ us).
Apache does its best to avoid copying bytes around in memory. The
+ first write of any request typically is turned into a
+ writev
which combines both the headers and the first hunk
+ of data:
- When doing HTTP/1.1 chunked encoding Apache will generate up to - four elementwritev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389
writev
s. The goal is to push the byte
- copying into the kernel, where it typically has to happen
- anyhow (to assemble network packets). On testing, various
- Unixes (BSDI 2.x, Solaris 2.5, Linux 2.0.31+) properly combine
- the elements into network packets. Pre-2.0.31 Linux will not
- combine, and will create a packet for each element, so
- upgrading is a good idea. Defining NO_WRITEV
will
- disable this combining, but result in very poor chunked
- encoding performance.
+ When doing HTTP/1.1 chunked encoding Apache will generate up to four
+ element writev
s. The goal is to push the byte copying into
+ the kernel, where it typically has to happen anyhow (to assemble
+ network packets). On testing, various Unixes (BSDI 2.x, Solaris 2.5,
+ Linux 2.0.31+) properly combine the elements into network packets.
+ Pre-2.0.31 Linux will not combine, and will create a packet for each
+ element, so upgrading is a good idea. Defining NO_WRITEV
+ will disable this combining, but result in very poor chunked encoding
+ performance.
The log write:
@@ -883,13 +917,12 @@ write(17, "127.0.0.1 - - [10/Sep/1997:23:39"..., 71) = 71 - can be deferred by definingBUFFERED_LOGS
. In this
- case up to PIPE_BUF
bytes (a POSIX defined
- constant) of log entries are buffered before writing. At no
- time does it split a log entry across a PIPE_BUF
- boundary because those writes may not be atomic.
- (i.e., entries from multiple children could become
- mixed together). The code does its best to flush this buffer
+ can be deferred by defining BUFFERED_LOGS
. In this case up
+ to PIPE_BUF
bytes (a POSIX defined constant) of log
+ entries are buffered before writing. At no time does it split a log
+ entry across a PIPE_BUF
boundary because those writes may
+ not be atomic. (i.e., entries from multiple children could
+ become mixed together). The code does its best to flush this buffer
when a child dies.
The lingering close code causes four system calls:
@@ -905,9 +938,8 @@ which were described earlier.Let's apply some of these optimizations:
- -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT
- -DBUFFERED_LOGS
and ExtendedStatus Off
.
- Here's the final trace:
-DSINGLE_LISTEN_UNSERIALIZED_ACCEPT -DBUFFERED_LOGS
and
+ ExtendedStatus Off
. Here's the final trace:
- That's 19 system calls, of which 4 remain relatively easy to - remove, but don't seem worth the effort. + That's 19 system calls, of which 4 remain relatively easy to remove, + but don't seem worth the effort. -@@ -932,91 +964,83 @@ munmap(0x400e3000, 6144) = 0
time(2)
system
- calls.time(2)
system calls.
mod_include
, these calls are used by few sites
- but required for backwards compatibility.mod_include
, these calls are used by few sites but
+ required for backwards compatibility.
Apache (on Unix) is a pre-forking model server. The - parent process is responsible only for forking - child processes, it does not serve any requests or - service any network sockets. The child processes actually - process connections, they serve multiple connections (one at a - time) before dying. The parent spawns new or kills off old - children in response to changes in the load on the server (it - does so by monitoring a scoreboard which the children keep up - to date).
- -This model for servers offers a robustness that other models - do not. In particular, the parent code is very simple, and with - a high degree of confidence the parent will continue to do its - job without error. The children are complex, and when you add - in third party code via modules, you risk segmentation faults - and other forms of corruption. Even should such a thing happen, - it only affects one connection and the server continues serving - requests. The parent quickly replaces the dead child.
+ parent process is responsible only for forking child + processes, it does not serve any requests or service any network + sockets. The child processes actually process connections, they serve + multiple connections (one at a time) before dying. The parent spawns + new or kills off old children in response to changes in the load on the + server (it does so by monitoring a scoreboard which the children keep + up to date). + +This model for servers offers a robustness that other models do not. + In particular, the parent code is very simple, and with a high degree + of confidence the parent will continue to do its job without error. The + children are complex, and when you add in third party code via modules, + you risk segmentation faults and other forms of corruption. Even should + such a thing happen, it only affects one connection and the server + continues serving requests. The parent quickly replaces the dead + child.
Pre-forking is also very portable across dialects of Unix. Historically this has been an important goal for Apache, and it continues to remain so.
-The pre-forking model comes under criticism for various
- performance aspects. Of particular concern are the overhead of
- forking a process, the overhead of context switches between
- processes, and the memory overhead of having multiple
- processes. Furthermore it does not offer as many opportunities
- for data-caching between requests (such as a pool of
- mmapped
files). Various other models exist and
- extensive analysis can be found in the papers
- of the JAWS project. In practice all of these costs vary
- drastically depending on the operating system.
Apache's core code is already multithread aware, and Apache - version 1.3 is multithreaded on NT. There have been at least - two other experimental implementations of threaded Apache, one - using the 1.3 code base on DCE, and one using a custom - user-level threads package and the 1.0 code base; neither is - publicly available. There is also an experimental port of - Apache 1.3 to Netscape's - Portable Run Time, which is - available (but you're encouraged to join the new-httpd mailing - list if you intend to use it). Part of our redesign for - version 2.0 of Apache will include abstractions of the server - model so that we can continue to support the pre-forking model, - and also support various threaded models. - +
The pre-forking model comes under criticism for various performance
+ aspects. Of particular concern are the overhead of forking a process,
+ the overhead of context switches between processes, and the memory
+ overhead of having multiple processes. Furthermore it does not offer as
+ many opportunities for data-caching between requests (such as a pool of
+ mmapped
files). Various other models exist and extensive
+ analysis can be found in the papers of
+ the JAWS project. In practice all of these costs vary drastically
+ depending on the operating system.
Apache's core code is already multithread aware, and Apache version + 1.3 is multithreaded on NT. There have been at least two other + experimental implementations of threaded Apache, one using the 1.3 code + base on DCE, and one using a custom user-level threads package and the + 1.0 code base; neither is publicly available. There is also an + experimental port of Apache 1.3 to Netscape's Portable + Run Time, which is available (but + you're encouraged to join the new-httpd mailing list + if you intend to use it). Part of our redesign for version 2.0 of + Apache will include abstractions of the server model so that we can + continue to support the pre-forking model, and also support various + threaded models.