Return-Path: Delivered-To: new-httpd-archive@hyperreal.org Received: (qmail 17661 invoked by uid 6000); 18 Oct 1997 05:25:46 -0000 Received: (qmail 17634 invoked from network); 18 Oct 1997 05:25:43 -0000 Received: from valis.worldgate.com (marcs@198.161.84.2) by taz.hyperreal.org with SMTP; 18 Oct 1997 05:25:43 -0000 Received: from localhost (marcs@localhost) by valis.worldgate.com (8.8.7/8.8.7) with SMTP id XAA07286 for ; Fri, 17 Oct 1997 23:25:18 -0600 (MDT) Date: Fri, 17 Oct 1997 23:25:18 -0600 (MDT) From: Marc Slemko To: Apache - BYOC Subject: on large numbers of virtual hosts and memory use Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: new-httpd-owner@apache.org Precedence: bulk Reply-To: new-httpd@apache.org Below is pulled from a thread on comp.unix.solaris. The basic issue is that each virtual host configured into Apache takes some memory size. Say 4k is a good number. Stronghold appears to, for whatever reason, take a _lot_ more. If you have thousands and thousands of virtual hosts, this memory use adds up. It is set in the parent, then the child processes shouldn't play with it. Since the child processes don't play with it, any modern system will not allocate pages for it but will simply flag it COW. So far so good; the memory isn't used, so it doesn't add to physical memory overhead. The trick is that many or most systems reserve swap space for pages flaged as COW so that it doesn't risk running out of swap when a process decides it wants to actually write to these pages. There lies the problem; you gobble huge amounts of swap. The workaround, since we know the child should never be writing to them so COW is simply an excuse for not doing shared memory, is to actually implement that via shared memory. Then you get rid of the pages mapped in each child, and are much happier. This doesn't look to be _that_ major an undertaking to me. Comments? ---------- Forwarded message ---------- >Path: scanner.worldgate.com!news.he.net!newsfeed.direct.ca!newsfeed.internetmci.com!207.69.200.61!mindspring!news.mindspring.com!demon.mindspring.com!news >From: news@demon.mindspring.com (News Reader) >Newsgroups: comp.unix.solaris >Subject: Re: reserving swap (was: Solaris 2.6 fd limits ? 256) >Date: 18 Oct 1997 04:34:58 GMT >Organization: MindSpring Enterprises Inc. >Lines: 91 >Distribution: inet >Message-ID: <629e9i$kaf@camel12.mindspring.com> >References: <3443B96C.6849@risq.qc.ca> <6245a4$knt$1@griffin.itc.gu.edu.au> <625j6g$hn3@camel18.mindspring.com> <627aa0$p2e$1@griffin.itc.gu.edu.au> >NNTP-Posting-Host: aslan.mindspring.net >Keywords: mmap(), MAP_NORESERVE, fork(), reserved swap >Xref: scanner.worldgate.com comp.unix.solaris:120551 In article <627aa0$p2e$1@griffin.itc.gu.edu.au>, Sean Vickery wrote: > >Solaris malloc() (always?) allocates heap memory using sbrk(), not mmap(). If you link against libmapmalloc, then you use versions of malloc() that use mmap() instead of sbrk(). I discovered later from the mmap() man page (which I've read a hundred times and overlooked this) that "...MAP_NORESERVE mappings are inherited across fork(2); at the time of the fork(2) swap space is reserved in the child for all private pages that currently exist in the parent; thereafter the child's mapping behaves as described above." which states that swap space is reserved across a fork() anyway, even if it is mmapped MAP_NORESERVE. >I'm pretty sure that sbrk() would always reserves swap space, in parent and >child. I'm guessing too now, but it makes sense. One wouldn't want to be >writing into some malloc()ed pages when suddenly one gets a SIGBUS, like one >does in the case where one's using MAP_NORESERVED mmap() pages and there's >not enough swap when a page actually gets written to for the first time. >If malloc did this it would be ridiculous: `Malloc() told me when it returned >successfully that the system had enough memory to give me some; now its >changed its mind.' This can only happen if you actually run out of physical swap and totally exhaust RAM. Good system planning and no serious memory leaks can make this unlikely. With copy on write via forks(), the system can waste enormous amounts of resources because so much is reserved, but so little is used. [snipped] >I wonder if you've really got a problem: you have to configure up heaps >of swap space on the system, sure, but it's not actually being used or slowing >down Apache in any way, so why worry about that? We worry because we need to allocate massive amounts of swap, potentially over 30 GB and up, (combined on all servers) that will never get used. That can cost a lot of money and cause some headaches. >> >> Here is a line from the Apache source code (http_main.c about line 686 : >> >> m = mmap((caddr_t)0, SCOREBOARD_SIZE, PROT_READ | PROT_WRITE, MAP_ANON | >> MAP_SHARED | MAP_NORESERVE, -1, 0); > >This looks to be explicitly allocating a block of shared memory, which isn't >what I thought we were talking about. MAP_ANON is non-standard, a Linuxism I >think, and isn't defined by Solaris' include files. When Apache is being compiled for Solaris, does it define MAP_ANON to be zero? I copy and pasted the wrong line. This line would not be #def'd and would not be compiled, but it demonstrates the change anyway. > >Clearly Apache isn't using mmap() to allocate the large block of memory that >you are concerned about. No, I don't even think they thought about the problem. We only first noticed it about 8 months ago when a system was having problems with "could not grow stack", or "could not fork(), no space available." It seemed to be out of memory, but had actually used very little physical swap. The problem only becomes significant when we have a large number of virtual hosts in Apache's config, or we are using Stronghold. We could shrink the daemon size down by splitting it up, but we would have to use the LISTEN directive. Using that caused us to run into the stdio FILE struct problem as well as a number of other log splitting difficulties. The whole thing just got too ugly. > >If what Apache wants to do is to share some data in memory between a parent >and children processes, perhaps it should do so explicitly: call shm_open(), >ftruncate() and mmap(..., MAP_SHARED, ...) in the parent, then fork() and >in the child, to insure against it writing to the memory, call mprotect(... >PROT_READ). Would this work, or would the child have to call shm_open, etc >itself too? That would certainly work, though would require a few extra lines >of code. From your description, implementing things this way would appear to >give the desired functionality, without the need to reserve large amounts of >swap or rely on copy-on-write. True, but that would require a good rewrite of Apache. Oracle does use shared memory this way. We have about 6-7 oracle daemons using 26MB of memory each but only about 40 MB of reserved swap is "used". The no-overcommit feature, while useful to keep machines from thrashing when they really do run out of memory, can seriously waste resources under some circumstances and should have an off switch. -- mikeh AT mindspring.net MindSpring Web Hosting Engineering ---------- Forwarded message ---------- >Path: scanner.worldgate.com!news.maxwell.syr.edu!newsfeed.internetmci.com!192.48.96.124!in4.uu.net!ozemail!news.mel.aone.net.au!newsfeed-in.aone.net.au!news.mel.connect.com.au!munnari.OZ.AU!bunyip.cc.uq.edu.au!newshost.gu.edu.au!usenet >From: Sean Vickery >Newsgroups: comp.unix.solaris >Subject: Re: reserving swap (was: Solaris 2.6 fd limits ? 256) >Date: 17 Oct 1997 09:14:40 GMT >Organization: Griffith University, Queensland, Australia >Lines: 64 >Distribution: inet >Message-ID: <627aa0$p2e$1@griffin.itc.gu.edu.au> >References: <3443B96C.6849@risq.qc.ca> <622sqa$fla@camel20.mindspring.com> <6245a4$knt$1@griffin.itc.gu.edu.au> <625j6g$hn3@camel18.mindspring.com> >NNTP-Posting-Host: centaur.itc.gu.edu.au >Keywords: mmap(), MAP_NORESERVE, fork(), reserved swap >Xref: scanner.worldgate.com comp.unix.solaris:120433 On 16 Oct 1997, News Reader wrote in comp.unix.solaris: > The parent writes the data to malloc'd memory not mmap'd memory(I assume). It > then forks and its children read and access that data but never change or > write to it (so they never get their own private copies). Unless you rewrite > malloc() to use MAP_NORESERVE in it's calls to mmap() (if that is what it > does), then I see no easy solution to this problem. But I'm still guessing. Mike, Solaris malloc() (always?) allocates heap memory using sbrk(), not mmap(). I'm pretty sure that sbrk() would always reserves swap space, in parent and child. I'm guessing too now, but it makes sense. One wouldn't want to be writing into some malloc()ed pages when suddenly one gets a SIGBUS, like one does in the case where one's using MAP_NORESERVED mmap() pages and there's not enough swap when a page actually gets written to for the first time. If malloc did this it would be ridiculous: `Malloc() told me when it returned successfully that the system had enough memory to give me some; now its changed its mind.' Now, gnumalloc does use mmap() when you ask it for a large (>120k or so) block. You could easily patch the gnumalloc source to include MAP_NORESERVE, but then Apache would have to be aware that it may have more memory mapped than can be backed in swap, the scenario I described in the previous paragraph. If you have plenty of swap, you may as well ignore this. I wonder if you've really got a problem: you have to configure up heaps of swap space on the system, sure, but it's not actually being used or slowing down Apache in any way, so why worry about that? > >Some details, code even, from your application may prove helpful. > > Here is a line from the Apache source code (http_main.c about line 686 : > > m = mmap((caddr_t)0, SCOREBOARD_SIZE, PROT_READ | PROT_WRITE, MAP_ANON | > MAP_SHARED | MAP_NORESERVE, -1, 0); This looks to be explicitly allocating a block of shared memory, which isn't what I thought we were talking about. MAP_ANON is non-standard, a Linuxism I think, and isn't defined by Solaris' include files. When Apache is being compiled for Solaris, does it define MAP_ANON to be zero? > I added MAP_NORESERVE to this call and every other mmap() call in the apache > source. It made no difference. Clearly Apache isn't using mmap() to allocate the large block of memory that you are concerned about. > [top, pmap and swap -s output snipped] If what Apache wants to do is to share some data in memory between a parent and children processes, perhaps it should do so explicitly: call shm_open(), ftruncate() and mmap(..., MAP_SHARED, ...) in the parent, then fork() and in the child, to insure against it writing to the memory, call mprotect(... PROT_READ). Would this work, or would the child have to call shm_open, etc itself too? That would certainly work, though would require a few extra lines of code. From your description, implementing things this way would appear to give the desired functionality, without the need to reserve large amounts of swap or rely on copy-on-write. Sean. -- Sean Vickery Ph: +61 (0)7 3875 6410 Systems Programmer Information Services Griffith University Copyright (C) 1997 All rights reserved. Remove the smeared Nordics to email. ---------- Forwarded message ---------- >Path: scanner.worldgate.com!rover.ucs.ualberta.ca!news.bc.net!logbridge.uoregon.edu!newsfeed.internetmci.com!207.69.200.61!mindspring!news.mindspring.com!demon.mindspring.com!news >From: news@demon.mindspring.com (News Reader) >Newsgroups: comp.unix.solaris >Subject: Re: reserving swap (was: Solaris 2.6 fd limits ? 256) >Date: 16 Oct 1997 17:34:08 GMT >Organization: MindSpring Enterprises Inc. >Lines: 171 >Distribution: inet >Message-ID: <625j6g$hn3@camel18.mindspring.com> >References: <3443B96C.6849@risq.qc.ca> <621pte$ep2$1@griffin.itc.gu.edu.au> <622sqa$fla@camel20.mindspring.com> <6245a4$knt$1@griffin.itc.gu.edu.au> >NNTP-Posting-Host: aslan.mindspring.net >Keywords: mmap(), MAP_NORESERVE, fork(), reserved swap >Xref: scanner.worldgate.com comp.unix.solaris:120339 In article <6245a4$knt$1@griffin.itc.gu.edu.au>, Sean Vickery wrote: >On 15 Oct 1997, News Reader wrote >in comp.unix.solaris: >> In article <621pte$ep2$1@griffin.itc.gu.edu.au>, >> Sean Vickery wrote: >> > >> >Solaris is perfectly capable of mapping pages without reserving swap. Simply >> >pass the MAP_NORESERVE flag to mmap(2). [snip] >> >> I don't think that will work in this situation. mmap(2) doesn't even come >> into play here. The problem is that the daemons fork() to handle new >> requests and the reserve memory count increases appropriately for that >> child's address space. Since fork() does copy on write (correct for Solaris?) >> and the large blocks of memory never get written to, hugh amounts of >> virtual memory (reserved swap) get used while the actual memory usage doesn't >> increase much. I'm just guessing here so I may be out in left field. I >> tried adding MAP_NORESERVE to mmap() calls in several programs but it made no >> difference. > >Mike, > >The mmap() system call is often at work behind the scenes, especially whenever >shared libraries are involved. It's one of the basic ways to have more pages >mapped into a process's virtual address space. (Others are sbrk() and exec.) >Certainly fork(2) does copy-on-write. > >I don't understand what you mean by `the large blocks of memory never get >written to', so I'll continue to attempt to answer your question in the most >general case. What are these large blocks of memory that never get written >to? And wouldn't that a bit inefficient? The parent writes the data to malloc'd memory not mmap'd memory(I assume). It then forks and its children read and access that data but never change or write to it (so they never get their own private copies). Unless you rewrite malloc() to use MAP_NORESERVE in it's calls to mmap() (if that is what it does), then I see no easy solution to this problem. But I'm still guessing. >Some details, code even, from your application may prove helpful. Here is a line from the Apache source code (http_main.c about line 686 : m = mmap((caddr_t)0, SCOREBOARD_SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_SHARED | MAP_NORESERVE, -1, 0); I added MAP_NORESERVE to this call and every other mmap() call in the apache source. It made no difference. > >From reading this, it seems pretty clear to me that if one had mmap()ed >a large chunk of memory with the MAP_NORESERVE flag, didn't write to any >of it (thus no private pages are required to be created), then fork()ed, >then no additional swap space would be consequently reserved for the >large chunk. Theoretically, yes. >If you'd like us to have a bash at a better answer, give us some details >about daemon you're writing. It's actually Apache and Stronghold (which uses the Apache source code). Apache, with about 6 Class C's on an Ultra2 uses nearly 6MB per daemon. Stronghold with two Class C's uses over 30MB per daemon, 1 class C is around 15MB. I compiled these daemons with MAP_NORESERVE in all mmap() calls: from top: PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 1997 nobody 1 35 0 15M 1932K sleep 0:00 0.13% httpsd The parent's RES size is over 9 MB, but the children only around 2 MB. /usr/proc/bin/pmap 1997 1997: ./httpsd -d /var/stronghold -f conf/httpsd_6.conf 00010000 860K read/exec dev: 162,2 ino: 2778085 000F6000 60K read/write/exec dev: 162,2 ino: 2778085 0010500012536K read/write/exec 0011700012464K [ heap ] EF580000 64K read/write/shared EF5A0000 16K read/exec /usr/lib/nss_files.so.1 EF5B3000 4K read/write/exec /usr/lib/nss_files.so.1 EF5C0000 28K read/exec /usr/lib/libw.so.1 EF5D6000 4K read/write/exec /usr/lib/libw.so.1 EF5E0000 12K read/exec /usr/lib/libmp.so.1 EF5F2000 4K read/write/exec /usr/lib/libmp.so.1 EF600000 508K read/exec /usr/lib/libc.so.1 EF68E000 32K read/write/exec /usr/lib/libc.so.1 EF696000 8K read/write/exec EF6A0000 12K read/exec /usr/lib/libintl.so.1 EF6B2000 4K read/write/exec /usr/lib/libintl.so.1 EF6C0000 388K read/exec /usr/lib/libnsl.so.1 EF730000 36K read/write/exec /usr/lib/libnsl.so.1 EF739000 32K read/write/exec EF760000 28K read/exec /usr/lib/libsocket.so.1 EF776000 8K read/write/exec /usr/lib/libsocket.so.1 EF780000 84K read/exec /usr/lib/libm.so.1 EF7A4000 8K read/write/exec /usr/lib/libm.so.1 EF7B0000 4K read/exec/shared /usr/lib/libdl.so.1 EF7C0000 4K read/write/exec EF7D0000 104K read/exec /usr/lib/ld.so.1 EF7F9000 8K read/write/exec /usr/lib/ld.so.1 EFFF6000 40K read/write/exec EFFF6000 40K [ stack ] This is on Solaris 2.5.1 with the following in /etc/system: set rlim_fd_cur=512 set rlim_fd_max=1024 set shmsys:ism_off = 1 set shmsys:shminfo_shmmax=8388608 set shmsys:shminfo_shmmin=1 set shmsys:shminfo_shmmni=100 set shmsys:shminfo_shmseg=10 set semsys:seminfo_semmns=200 set semsys:seminfo_semmni=70 Here is another from Apache (with 6 class C's in its conf file) and with MAP_NORESERVE added to all mmap() calls (this is on a Solaris 2.6 box): PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 18813 nobody 23 0 5612K 620K sleep 0:00 0.00% 0.00% httpd.test /usr/proc/bin/pmap 18813 18813: ./httpd.test -f conf/httpd.conf.test 00010000 268K read/exec dev:32,8 ino:265016 00062000 12K read/write/exec dev:32,8 ino:265016 00065000 3972K read/write/exec [ heap ] EF600000 16K read/exec /usr/lib/nss_files.so.1 EF613000 4K read/write/exec /usr/lib/nss_files.so.1 EF620000 12K read/exec /usr/lib/libmp.so.2 EF632000 4K read/write/exec /usr/lib/libmp.so.2 EF640000 588K read/exec /usr/lib/libc.so.1 EF6E2000 24K read/write/exec /usr/lib/libc.so.1 EF6E8000 8K read/write/exec [ anon ] EF6F0000 4K read/write/exec [ anon ] EF700000 444K read/exec /usr/lib/libnsl.so.1 EF77E000 32K read/write/exec /usr/lib/libnsl.so.1 EF786000 24K read/write/exec [ anon ] EF790000 4K read/write/shared [ anon ] EF7A0000 32K read/exec /usr/lib/libsocket.so.1 EF7B7000 4K read/write/exec /usr/lib/libsocket.so.1 EF7B8000 4K read/write/exec [ anon ] EF7C0000 4K read/exec/shared /usr/lib/libdl.so.1 EF7D0000 112K read/exec /usr/lib/ld.so.1 EF7FB000 8K read/write/exec /usr/lib/ld.so.1 EF7FD000 4K read/write/exec [ anon ] EFFF9000 28K read/write/exec [ stack ] total 5612K With 100 of these running, "swap" space dropped from 2.1GB to around 1.5 GB. According to top 3.4 on Solaris 2.6. vmstat procs memory page disk faults cpu r b w swap free re mf pi po fr de sr s1 s3 s6 -- in sy cs us sy id 0 0 0 5840 5504 0 7 0 0 0 60 0 0 0 0 0 11 39 25 1 1 98 swap -l swapfile dev swaplo blocks free /dev/dsk/c0t1d0s1 32,9 8 1475912 1475912 /dev/dsk/c0t3d0s4 32,28 8 2511032 2511032 swap -s total: 39328k bytes allocated + 411544k reserved = 450872k used, 1635400k available -- mikeh AT mindspring.net