httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dean Gaudet <>
Subject Re: cvs commit: apache-1.3/src/main util_uri.c
Date Sat, 14 Mar 1998 23:24:54 GMT
Well there's some serious diminishing returns... I mean I can take the
code right now and profile it with a certain load and then shave 1% or 2%
off that (I've got a bunch of other such optimizations just lying around). 
But at some point things go beyond the easily comprehendable point... and
parse_uri_components is probably at that boundary already, just by virtue
of the amount of state that each position in the function represents.

The speedup I'm referring to though is kind of cute... suppose you're looking
for a byte c, and that cf is the carry flag (available in assembly, not in C)
then here's a cute way to check 4 bytes at a time to see if c is one of them:

    unsigned cccc;
    unsigned *s;

    /* need c duplicated into all 4 bytes */
    cccc = (c << 24) | (c << 16) | (c << 8) | c;

    for (;;) {
	unsigned u;
	unsigned t;

	/* after this xor we're really interested in knowing if
	 * one of t's bytes is 0 */
	t = *s ^ cccc;

	u = t + 0xfefefeff;
	if (cf) {
	    none of t's bytes are 0
	else {
	    at least one of t's bytes are 0

Go through the logic -- if t's lowest byte is non-zero then when 0xff is
added to it it will carry into the next byte.  If that byte is non-zero
then 0xfe + 1bitcarry + t's second lowest byte will overflow and carry
into the third lowest byte.  And so on until it carries out the top.

But if one of the bytes is 0 then it won't carry out the top.

(If it's not clear yet ignore the xor and pretend we're just implementing
strlen() looking for a 0 byte.  In this case the xor isn't required.)

I'm not sure who the originator of this trick is.  But it's a great way to
do word-size scanning of memory for particular bytes.  We'd just need a
few extra xors and adds in the inner loop.  It would take um, 6 registers
to do this for 4 byte values, which conveniently fits into the 386 register
set and is enough for parse_uri_components.

This results in 15 instructions to do each 4 bytes.  Whereas with the current 
code we require (at least) 6 instructions per byte.  wave your hands here.

Is it worth it?  Probably not from the complexity point of view.  If
you look at a profile it's all system calls in the top 5.


On Sat, 14 Mar 1998, Brian Behlendorf wrote:

> At 07:11 AM 3/13/98 -0000, wrote:
> >dgaudet     98/03/12 23:11:03
> >
> >  Modified:    src/main util_uri.c
> >  Log:
> >  Deal with the performance problem in parse_uri_components().  This new
> >  version is over two orders of magnitude faster based on timing trials
> >  requesting the test page from mod_test_util_uri.  It's 50% faster overall
> >  when doing a zb /index.html with the default index.html.
> >  
> >  I'm still resiting the urge to hardcode i386 assembly language with
> >  a C fallback for the rest of the world ;)
> Don't hold back, Dean!  :)
> As a matter of fact, a certain company may be interested in hiring a
> contractor to do stuff like this with the Apache code over a two-three week
> span; particularly focusing on Pentium II and Merced architectures.  If
> you're interested in this, let me know.
> 	Brian
> --=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
> specialization is for insects

View raw message