Return-Path: Delivered-To: apmail-httpd-dev-archive@httpd.apache.org Received: (qmail 71111 invoked by uid 500); 5 Sep 2001 03:03:24 -0000 Mailing-List: contact dev-help@httpd.apache.org; run by ezmlm Precedence: bulk Reply-To: dev@httpd.apache.org list-help: list-unsubscribe: list-post: Delivered-To: mailing list dev@httpd.apache.org Received: (qmail 71100 invoked from network); 5 Sep 2001 03:03:24 -0000 Date: Tue, 04 Sep 2001 20:00:35 -0700 From: Brian Pane Subject: remaining CPU bottlenecks in 2.0 To: dev@httpd.apache.org Message-id: <3B959553.3040303@pacbell.net> MIME-version: 1.0 Content-type: text/plain; charset=us-ascii; format=flowed Content-transfer-encoding: 7BIT X-Accept-Language: en-us User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.3) Gecko/20010801 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I'm currently studying profiling data from an httpd built from a CVS snapshot earlier today. In general, the performance of 2.0 is starting to look good. Here are the top 30 functions, ranked according to their CPU utilization. (This summary is derived from a Quantify profile of a threaded-MPM httpd handling mod_include requests on Solaris; thanks to Blaise Tarr at CNET for the profile data.) CPU time function (% of total) -------- ------------ 1. find_start_sequence 23.9 2. mmap 14.1 3. munmap 10.2 4. _writev 9.4 5. _so_accept 4.7 6. _lwp_mutex_unlock 4.3 7. stat 3.2 8. _close 2.4 9. _write 2.4 10. __door_return 2.3 cumulative total: 76.9% 11. __open 2.0 12. _so_getsockname 1.6 13. find_entry 1.5 14. strlen 1.2 15. strcasecmp 1.2 16. _so_shutdown 1.2 17. _read 0.84 18. fstat64 0.82 19. __lwp_sema_post 0.80 20. memset 0.78 cumulative total: 88.8% 21. apr_hash_next 0.53 22. apr_file_read 0.46 23. _lrw_unlock 0.39 24. __fork1 0.39 25. apr_palloc 0.39 26. strncpy 0.28 27. memchr 0.26 28. _ltzset_u 0.24 29. overlay_extension_mappings 0.20 30. apr_hash_set 0.20 cumulative total: 92.18% Some notes on the data: * The discussion here covers only CPU utilization. There are other aspects of performance, like multiprocessor scalability, that are independent of this data. * 7 of the top 10 functions are syscalls that I'd expect to see in the top 10 for an httpd. :-) * find_start_sequence() is the main scanning function within mod_include. There's some research in progress to try to speed this up significantly. * With the exception of find_start_sequence(), all the opportunities for optimization are individually small. I expect that the only way to get an additional 5% reduction in CPU usage, for example, will be to fix ten things that each account for 0.5%. * _lwp_mutex_unlock() gets from pthread_mutex_unlock(), but only from a small fraction of pthread_mutex_unlock calls (Can someone familiar with Solaris threading internals explain this one?) * Collectively, stat and open comprise 5% of the total CPU time. It would be faster to do open+fstat rather than stat+open (as long as the server is delivering mostly 200s rather than 304s), but that might be too radical a change. Anybody have thoughts on this? * I think the __door_return() call is an anomaly of the profiling instrumentation itself. * find_entry() is part of the apr_hash_t implementation. Most of the calls to find_entry()and apr_hash_next() are coming from the mod_mime dir-merge function. * strlen() is called from lots of places throughout the code, with the most frequent calls being from apr_pstrdup, apr_pstrcat, and time-formatting functions used in apr_rfc822_date. * strcasecmp() is used heavily by the apr_table_t implementation (sort_overlap, apr_table_get, apr_table_setn). Converting tables to hashes is is a potential speedup. However, most of the table API calls are on request_rec->headers_(in|out); changing those would impact a *lot* of code. * fstat64() happens during fcntl() on sockets (it appears to be a Solaris implementation detail) * memset() is called mostly from apr_pcalloc(), which in turn is used in too many places to yield any easy optimization opportunities. * memchr() is called from apr_pstrndup(), used mostly in apr_uri_parse() and mod_mime's analyze_ct(). It may be possible to optimize away some of the calls in the latter. --Brian