Return-Path: Delivered-To: apmail-new-httpd-archive@apache.org Received: (qmail 56721 invoked by uid 500); 10 Apr 2000 19:07:44 -0000 Mailing-List: contact new-httpd-help@apache.org; run by ezmlm Precedence: bulk X-No-Archive: yes Reply-To: new-httpd@apache.org list-help: list-unsubscribe: list-post: Delivered-To: mailing list new-httpd@apache.org Received: (qmail 56710 invoked from network); 10 Apr 2000 19:07:44 -0000 Date: Mon, 10 Apr 2000 12:07:42 -0700 (PDT) From: dean gaudet To: Koushik Chakraborty cc: new-httpd@apache.org Subject: Re: [Fwd: Query About Apache] In-Reply-To: <38F1F4A4.5EF58243@Golux.Com> Message-ID: X-comment: visit http://arctic.org/~dean/legal for information regarding copyright and disclaimer. MIME-Version: 1.0 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII Content-ID: X-Spam-Rating: locus.apache.org 1.6.2 0/1000/N > Date: Mon, 10 Apr 2000 17:41:29 +0530 (IST) > From: Koushik Chakraborty > To: coar@Apache.Org > Subject: Query About Apache > > Hi, > I am a senior undergraduate majoring in Computer Science and Engg from > Indian Institute of Technology, Kanpur. I have been working in a project > on efficient buffer management while transfering data from one socket to > another. I have also designed and implemented a system call which does the > same action. To elaborate, given two socket FDs it transfer the data from one > to another without any buffer copy in user level as required by read/write. > > When used in a dummy proxy (which just contacts a server and pass on the data > to the client), I have had considerable improvement over the normal method of > successive read and write. The time spent in the kernel showed near 5 times > improvement. > > I hooked in this system call in the proxy module of Apache server at the point > where it is transfering the body (ap_proxy_send_fb function in proxy_util.c) the > improvement is not significant at all. On average, it performs just as good as > the original version. > > Can you give any insight as to why is this happening specially even with large > file size where we are actually saving one buffer copy as required in normal > mode of operation. Are you doing any optimization on your read/write ? I saw > that read is called many times over during the transmission (from buff.c as in > src/main). yay, another "hey why don't modern computers behave like nice theoretical computers should?" question! yes apache does optimise here. if you look at ap_proxy_send_fb you'll see it's using an 8k on stack buffer, which it does an ap_bread() into and an ap_bwrite() out of. if you look at ap_bread() you'll see that if the buffer it's reading into is large enough it bypasses the extra buffer attached to the BUFF *. and in ap_bwrite() you'll see a LARGE_WRITE heuristic which similarly bypasses buffering. so let's analyse the path of each byte: i'll assume you're using a 500Mhz pentium-iii, with 100Mhz memory, a standard 100baseT NIC, and either linux or freebsd. modifications for other environments should be straightforward. all that really matters is that you've got a modern superscalar cpu and a reasonably intelligent operating system. i'll further assume that the cpu caches are all primed because you're running the more than once for the measurement; and that you're not using an SMP box (which can cause more memory transactions if processes migrated); and that you've got a limited number of other processes going at the same time, so that we can assume the working set fits in the 512kB L2 cache of the CPU. and finally, to make the analysis even simpler i'll assume every read/write misses the L1 cache. apache's code: - NIC DMA incoming packet -> kernel memory - runs at PCI bus speed -- 33Mhz/32-bit - read(), kernel reads bytes from packet and writes them to user buffer - read happens at memory bus speed 100Mhz/64-bit because it couldn't possibly be cached - write happens at L2 cache speed which on a pentium-iii is same as cpu speed -- 500Mhz/64-bit (why does this happen at L2 cache speed? because the cache is primed -- the entire buffer is already in the L2 cache, so when we overwrite it there's no need for the cache to load from memory) - write(), kernel reads data from user buffer and forms it into packets - read happens at cache speed -- 500Mhz/64-bit - write into packet happens at memory speed -- 100Mhz/64-bit (this is because the kernel has to flush the packet into RAM so that the DMA can occur) - NIC DMAs outgoing packet from kernel memory - runs at PCI bus speed -- 33Mhz/32-bit ok remember that one transaction at 33Mhz takes 1/33Mhz = 30us one transaction at 100Mhz takes 1/100Mhz = 1us one transaction at 500Mhz takes 1/500Mhz = .2us let's add things up. each 64-bits of data requires: 4 transfers @ 33Mhz = 120 us 2 transfers @100Mhz = 2 us 2 transfers @500Mhz = .4us ------- 122.4us and with your code, what of those transfers do you eliminate? you eliminate the two transfers @500Mhz. not much eh? the actual picture is more complicated, i obviously am skimping on some of the details... but i think the answer should be pretty obvious -- as long as there are such huge disparities between CPU cache speeds, memory speeds, and i/o bus speeds, and as long as you keep your working set within the size of the L2 cache, an extra copy here and there doesn't hurt much. you could try doing your experiments such that your working set exceeds the size of the L2 cache; you can do this by jacking up the number of clients. but even still, worst case then has 4 transfers @ 33Mhz, and 8 transfers @ 100Mhz, and i think you eliminate only two of the 100Mhz transfers. obviously a 66Mhz/64-bit PCI bus would change the picture dramatically -- then we'd have: 2 transfers @ 66Mhz = 30 us 2 transfers @100Mhz = 2 us 2 transfers @500Mh = .4 us but by the time 66Mhz PCI is ubiquitous we'll probably be using merced chips running at 1Ghz with something like 2Mb L2s @ 1Ghz, and with 200Mhz RAMBUS ram... and the numbers still work out pretty much in favour of the extra copy. oh i should note that i think certain sparc architectures, and probably others such as HP, allow I/O directly into L2 caches, and probably run I/O busses at faster rates anyhow. but, i doubt the PCI bus based sparcs do this, and those are becomming the cheapest/most common these days :) if you want to find out your bus speeds and cache sizes i strongly suggest the lmbench tool . -dean