Mailing-List: contact dev-help@httpd.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@httpd.apache.org
Date: Tue, 04 Sep 2001 20:00:35 -0700
From: Brian Pane <bpane@pacbell.net>
Subject: remaining CPU bottlenecks in 2.0
To: dev@httpd.apache.org
Message-id: <3B959553.3040303@pacbell.net>
MIME-version: 1.0
Content-type: text/plain; charset=us-ascii; format=flowed
Content-transfer-encoding: 7BIT
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.3) Gecko/20010801

I'm currently studying profiling data from an httpd built from
a CVS snapshot earlier today.

In general, the performance of 2.0 is starting to look good.

Here are the top 30 functions, ranked according to their CPU utilization.
(This summary is derived from a Quantify profile of a threaded-MPM httpd
handling mod_include requests on Solaris; thanks to Blaise Tarr at CNET
for the profile data.)

                         CPU time
   function             (% of total)
   --------             ------------
 1. find_start_sequence    23.9
 2. mmap                   14.1
 3. munmap                 10.2
 4. _writev                 9.4
 5. _so_accept              4.7
 6. _lwp_mutex_unlock       4.3
 7. stat                    3.2
 8. _close                  2.4
 9. _write                  2.4
10. __door_return           2.3
                                                          cumulative 
total: 76.9%
11. __open                  2.0
12. _so_getsockname         1.6
13. find_entry              1.5
14. strlen                  1.2
15. strcasecmp              1.2
16. _so_shutdown            1.2
17. _read                   0.84
18. fstat64                 0.82
19. __lwp_sema_post         0.80
20. memset                  0.78
                                 cumulative total: 88.8%
21. apr_hash_next           0.53
22. apr_file_read           0.46
23. _lrw_unlock             0.39
24. __fork1                 0.39
25. apr_palloc              0.39
26. strncpy                 0.28
27. memchr                  0.26
28. _ltzset_u               0.24
29. overlay_extension_mappings    0.20
30. apr_hash_set            0.20
                                 cumulative total: 92.18%

Some notes on the data:

* The discussion here covers only CPU utilization.  There are other
  aspects of performance, like multiprocessor scalability, that
  are independent of this data.

* 7 of the top 10 functions are syscalls that I'd expect to see in
  the top 10 for an httpd. :-)

* find_start_sequence() is the main scanning function within
  mod_include.  There's some research in progress to try to speed
  this up significantly.

* With the exception of find_start_sequence(), all the opportunities
  for optimization are individually small.  I expect that the only
  way to get an additional 5% reduction in CPU usage, for example,
  will be to fix ten things that each account for 0.5%.

* _lwp_mutex_unlock() gets from pthread_mutex_unlock(),
  but only from a small fraction of pthread_mutex_unlock calls
  (Can someone familiar with Solaris threading internals explain
  this one?)

* Collectively, stat and open comprise 5% of the total CPU time.
  It would be faster to do open+fstat rather than stat+open (as
  long as the server is delivering mostly 200s rather than 304s),
  but that might be too radical a change.  Anybody have thoughts
  on this?

* I think the __door_return() call is an anomaly of the profiling
  instrumentation itself.

* find_entry() is part of the apr_hash_t implementation.  Most of
  the calls to find_entry()and apr_hash_next() are coming from the
  mod_mime dir-merge function.

* strlen() is called from lots of places throughout the code, with
  the most frequent calls being from apr_pstrdup, apr_pstrcat, and
  time-formatting functions used in apr_rfc822_date.

* strcasecmp() is used heavily by the apr_table_t implementation
  (sort_overlap, apr_table_get, apr_table_setn).  Converting tables
  to hashes is is a potential speedup.  However, most of the table
  API calls are on request_rec->headers_(in|out); changing those
  would impact a *lot* of code.

* fstat64() happens during fcntl() on sockets (it appears to be a
  Solaris implementation detail)

* memset() is called mostly from apr_pcalloc(), which in turn is
  used in too many places to yield any easy optimization opportunities.

* memchr() is called from apr_pstrndup(), used mostly in apr_uri_parse()
  and mod_mime's analyze_ct().  It may be possible to optimize away
  some of the calls in the latter.

--Brian