httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dean gaudet <>
Subject Re: [Fwd: Query About Apache]
Date Mon, 10 Apr 2000 19:07:42 GMT
> Date: Mon, 10 Apr 2000 17:41:29 +0530 (IST)
> From: Koushik Chakraborty <>
> To: coar@Apache.Org
> Subject: Query About Apache
> Hi,
>   I am a senior undergraduate majoring in Computer Science and Engg from 
> Indian Institute of Technology, Kanpur. I have been working in a project
> on efficient buffer management while transfering data from one socket to
> another. I have also designed and implemented a system call which does the
> same action. To elaborate, given two socket FDs it transfer the data from one
> to another without any buffer copy in user level as required by read/write.
> When used in a dummy proxy (which just contacts a server and pass on the data
> to the client), I have had considerable improvement over the normal method of
> successive read and write. The time spent in the kernel showed near 5 times 
> improvement. 
> I hooked in this system call in the proxy module of Apache server at the point
> where it is transfering the body (ap_proxy_send_fb function in proxy_util.c) the
> improvement is not significant at all. On average, it performs just as good as
> the original version. 
> Can you give any insight as to why is this happening specially even with large
> file size where we are actually saving one buffer copy as required in normal
> mode of operation. Are you doing any optimization on your read/write ? I saw
> that read is called many times over during the transmission (from buff.c as in
> src/main).

yay, another "hey why don't modern computers behave like nice theoretical
computers should?" question!

yes apache does optimise here.  if you look at ap_proxy_send_fb you'll
see it's using an 8k on stack buffer, which it does an ap_bread() into
and an ap_bwrite() out of.

if you look at ap_bread() you'll see that if the buffer it's reading into
is large enough it bypasses the extra buffer attached to the BUFF *.

and in ap_bwrite() you'll see a LARGE_WRITE heuristic which similarly
bypasses buffering.

so let's analyse the path of each byte:

i'll assume you're using a 500Mhz pentium-iii, with 100Mhz memory, a
standard 100baseT NIC, and either linux or freebsd.  modifications for
other environments should be straightforward.  all that really matters
is that you've got a modern superscalar cpu and a reasonably intelligent
operating system.

i'll further assume that the cpu caches are all primed because you're
running the more than once for the measurement; and that you're not using an
SMP box (which can cause more memory transactions if processes migrated);
and that you've got a limited number of other processes going at the
same time, so that we can assume the working set fits in the 512kB L2
cache of the CPU.

and finally, to make the analysis even simpler i'll assume every read/write
misses the L1 cache.

apache's code:

- NIC DMA incoming packet -> kernel memory
  - runs at PCI bus speed -- 33Mhz/32-bit

- read(), kernel reads bytes from packet and writes them to user buffer
  - read happens at memory bus speed 100Mhz/64-bit because it couldn't
    possibly be cached
  - write happens at L2 cache speed which on a pentium-iii is same as
    cpu speed -- 500Mhz/64-bit
    (why does this happen at L2 cache speed?  because the cache is
    primed -- the entire buffer is already in the L2 cache, so when
    we overwrite it there's no need for the cache to load from memory)

- write(), kernel reads data from user buffer and forms it into packets
  - read happens at cache speed -- 500Mhz/64-bit
  - write into packet happens at memory speed -- 100Mhz/64-bit
    (this is because the kernel has to flush the packet into RAM so
    that the DMA can occur)

- NIC DMAs outgoing packet from kernel memory
  - runs at PCI bus speed -- 33Mhz/32-bit

ok remember that one transaction at 33Mhz takes 1/33Mhz   = 30us
                 one transaction at 100Mhz takes 1/100Mhz =  1us
		 one transaction at 500Mhz takes 1/500Mhz =  .2us

let's add things up.  each 64-bits of data requires:

	4 transfers @ 33Mhz  = 120  us
	2 transfers @100Mhz  =   2  us
	2 transfers @500Mhz  =    .4us

and with your code, what of those transfers do you eliminate? you
eliminate the two transfers @500Mhz.

not much eh?

the actual picture is more complicated, i obviously am skimping on some of
the details... but i think the answer should be pretty obvious -- as long
as there are such huge disparities between CPU cache speeds, memory
speeds, and i/o bus speeds, and as long as you keep your working set
within the size of the L2 cache, an extra copy here and there doesn't hurt

you could try doing your experiments such that your working set exceeds
the size of the L2 cache; you can do this by jacking up the number of
clients.  but even still, worst case then has 4 transfers @ 33Mhz, and 8
transfers @ 100Mhz, and i think you eliminate only two of the 100Mhz

obviously a 66Mhz/64-bit PCI bus would change the picture dramatically --
then we'd have:

	2 transfers @ 66Mhz  = 30 us
	2 transfers @100Mhz  =  2 us
	2 transfers @500Mh   =  .4 us

but by the time 66Mhz PCI is ubiquitous we'll probably be using merced
chips running at 1Ghz with something like 2Mb L2s @ 1Ghz, and with 200Mhz
RAMBUS ram... and the numbers still work out pretty much in favour of the
extra copy.

oh i should note that i think certain sparc architectures, and probably
others such as HP, allow I/O directly into L2 caches, and probably run I/O
busses at faster rates anyhow.  but, i doubt the PCI bus based sparcs do
this, and those are becomming the cheapest/most common these days :)

if you want to find out your bus speeds and cache sizes i strongly suggest
the lmbench tool <>.


View raw message