httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Susi <>
Subject Re: SSL downloads faster than non SSL?
Date Wed, 03 Aug 2005 15:48:18 GMT
William A. Rowe, Jr. wrote:

>In the APR library, yes, we translate 'apr_sendfile' to TransmitFile()
>on win32.  Some other magic occurs to obtain a file handle which can 
>be passed to TransmitFile.  But there are enough flaws in the TF() api
>that perhaps this would be better defaulted to 'off'.
Really?  Are you quite sure?  I wonder what's hosing it all up.  Once 
you hand TransmitFile() the socket and file handles, it should blast the 
file over the network nice and fast. 

>That is also available.  As you are aware TransmitFile() lives entirely
>in the kernel, so there are far fewer user<->kernel mode transitions.

Yes, there are fewer user-kernel transitions, but not that many and they 
are relatively inexpensive.  By far the largest savings that 
TransmitFile() gains is from not having to copy the data from user to 
kernel buffers before it can be sent over the network.  A conventional 
read() and send() call pair ends up making a copy from kernel FS buffer 
memory to user buffer, then back to kernel socket buffer memory.  That's 
where most of the CPU time is wasted.  A few years ago I wrote an FTP 
server and tried using both TransmitFile() and using overlapped IO.  By 
disabling kernel buffering on the socket and memory mapping 2 32 KB 
views of the file at once and overlapping both sends, I was able to 
match both the network throughput and low CPU load of TransmitFile(). 

Specifically, I developed this on a PII-233 system with two fast 
ethernet NICs installed.  Using several other FTP servers popular at the 
time, I was only able to manage around 5500 KB/s though one NIC using 
100% of the CPU.  Using either TransmitFile() or zero copy overlapped 
IO, I was able to push 11,820 KB/s over one NIC and 8,500 KB/s over the 
other ( not as good of a card apparently ) simultaneously using 1% of 
the CPU.  There was no noticeable difference between TransmitFile() and 
the overlapped IO.  Oh, and also I had to find a registry setting to 
make TransmitFile() behave on my NT 4 workstation system the way it does 
by default on NT server to get it to perform well.  By default on 
workstation it was not nearly so good. 

>But if you turn off sendfile, and leave mmap on, Win32 (on Apache 2.0,
>but not back in Apache 1.3) does use memory mapped I/O.
>You suggest this works with SSL to create zero-copy?  That's not quite
>correct, since there is the entire translation phase required.

My understanding is that the current code will memory map the data file, 
optionally encrypt it with SSL, and then call a conventional send().  
Using send() on a memory mapped file view instead of read() eliminates 
one copy, but there is still another one made when you call send(), so 
you're only half way there.  To eliminate that second copy you have to 
ask the kernel to set the socket buffer size to 0 ( I can't remember if 
that was done with setsockopt or ioctlsocket ) and then use overlapped 
IO ( preferably with IO completion ports for notification ) to give the 
kernel multiple pending buffers to send.  That way you eliminate the 
second buffer copy and the NIC always has a locked buffer from which it 
can DMA. 

>:)  We seriously appreciate all efforts.  If you are very familiar with
>Win32 internals, the mpm_winnt.c does need work; I hope to change this
>mpm to follow unix in setting up/tearing down threads when we hit min
>and max thresholds.  Obviously many other things in the (fairly simple)
>win32 implementation can be improved.  Support for multiple processes
>is high on the list, since a fault in a single thread brings down the
>process and many established connections, and introduces a large latency 
>until the next worker process is respawned and accepting connections.

Well, ideally you just need a small number of worker threads using an IO 
completion port.  This yields much better results than allocating one 
thread to each request, even if those threads are created in advance.  I 
have been trying to gain some understanding of the Apache 2 bucket 
brigade system, but it's been a bit difficult just perusing the docs on 
the web site in my spare time.  From what I've been able to pick up so 
far though, it looks like the various processing stages have the option 
to either hold onto a bucket to process asynchronously, process the 
bucket synchronously, or simply pass it down to the next layer 
immediately.  What I have not been able to figure out is if any of the 
processing layers tend to make system calls to block the thread.  
Provided that you don't do very much to block the thread while 
processing a request, then if you were to use an IO completion port 
model, a small handful of threads could service potentially thousands of 
requests at once.

Also a fault in one thread does not have to kill the entire process, you 
can catch the fault and handle it more gracefully.  I'd love to dig into 
mpm_winnt but at the moment my plate is a bit full.  Maybe in another 
month or two I'll be able to take a week off from work and dig into it. 

Of course, if someone else who is already familiar with the code wants 
to work on it, I'd be quite happy to consult ;)

View raw message