Received: by taz.hyperreal.com (8.7.5/V2.0) id MAA05418; Wed, 21 Aug 1996 12:29:28 -0700 (PDT) Received: from battra.telebase.com by taz.hyperreal.com (8.7.5/V2.0) with ESMTP id MAA05411; Wed, 21 Aug 1996 12:29:25 -0700 (PDT) Received: from wormhole.telebase.com by battra.telebase.com id PAA06954 for ; Wed, 21 Aug 1996 15:29:23 -0400 (EDT) Received: from spudboy.telebase.com (spudboy.telebase.com [172.16.2.215]) by wormhole.telebase.com (8.7.4/8.6.9.1) with ESMTP id PAA12821 for ; Wed, 21 Aug 1996 15:29:22 -0400 (EDT) Received: (from chuck@localhost) by spudboy.telebase.com (8.7.5/8.6.9.1) id PAA09421 for new-httpd@hyperreal.com; Wed, 21 Aug 1996 15:29:18 -0400 (EDT) From: Chuck Murcko Message-Id: <199608211929.PAA09421@telebase.com.> Subject: Re: irix 5.3 and 1.1.1 To: new-httpd@hyperreal.com Date: Wed, 21 Aug 1996 15:29:17 -0400 (EDT) In-Reply-To: <4vafvd$5n6@re.hotwired.com> from "Dean Gaudet" at Aug 19, 96 07:40:29 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-new-httpd@apache.org Precedence: bulk Reply-To: new-httpd@hyperreal.com Are we getting an unexpected EINVAL or something like it that's messing up the mutex operation? We just found something like that here with some Solaris software we'd written. One possible scenario that could cause this is a bad pointer stepping on lock_it or unlock_it in the fcntl() calls. Dean Gaudet liltingly intones: > > I think I'm running into the children-not-dying problem on irix 5.3 > under 1.1.1. I applied a patch from Ben that I thought dealt with this, > but it doesn't seem to be working. Have I missed another related patch? > I'll include (part of) Ben's patch below for reference (revision numbers > are mine, not hyperreal's). > > The sympton is: the machine's load shoots up to 27+ at which point a > monitoring script I have running cuts in and kills the webserver and > restarts it. > > Dean > > Index: http_main.c > =================================================================== > RCS file: /hot/repository/apache/src/http_main.c,v > retrieving revision 1.16 > retrieving revision 1.17 > diff -c -r1.16 -r1.17 > *** http_main.c 1996/08/02 07:19:49 1.16 > --- http_main.c 1996/08/02 07:23:10 1.17 > *************** > *** 845,853 **** > #endif > } > > int wait_or_timeout (int *status) > { > ! wait_or_timeout_retval = -1; > > #if defined(NEXT) > if (setjmp(wait_timeout_buf) != 0) { > --- 845,874 ---- > #endif > } > > + #ifdef BROKEN_WAIT > + /* > + Some systems appear to fail to deliver dead children to wait() at times. > + This sorts them out. > + */ > + void reap_children() > + { > + int status,n; > + > + for(n=0 ; n < HARD_SERVER_LIMIT ; ++n) > + if(scoreboard_image->servers[n].status != SERVER_DEAD > + && waitpid(scoreboard_image->servers[n].pid,&status,WNOHANG) == -1 > + && errno == ECHILD) > + { > + sync_scoreboard_image(); > + update_child_status(n,SERVER_DEAD,NULL); > + } > + } > + #endif > + > int wait_or_timeout (int *status) > { > ! int wait_or_timeout_retval = -1; > ! static int ntimes; > > #if defined(NEXT) > if (setjmp(wait_timeout_buf) != 0) { > *************** > *** 857,863 **** > errno = ETIMEDOUT; > return wait_or_timeout_retval; > } > ! > signal (SIGALRM, longjmp_out_of_alarm); > alarm(1); > #if defined(NEXT) > --- 878,890 ---- > errno = ETIMEDOUT; > return wait_or_timeout_retval; > } > ! #ifdef BROKEN_WAIT > ! if(++ntimes == 60) > ! { > ! reap_children(); > ! ntimes=0; > ! } > ! #endif > signal (SIGALRM, longjmp_out_of_alarm); > alarm(1); > #if defined(NEXT) > chuck Chuck Murcko N2K Inc. Wayne PA chuck@telebase.com And now, on a lighter note: Our OS who art in CPU, UNIX be thy name. Thy programs run, thy syscalls done, In kernel as it is in user!