www-apache-bugdb mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan J Kurz <n...@tripod.tripod.com>
Subject Re: general/792: race condition with SIGUSR1 graceful restart
Date Thu, 26 Jun 1997 22:00:02 GMT
The following reply was made to PR general/792; it has been noted by GNATS.

From: Nathan J Kurz <nate@tripod.tripod.com>
To: dgaudet@arctic.org (Dean Gaudet)
Subject: Re: general/792: race condition with SIGUSR1 graceful restart
Date: Thu, 26 Jun 1997 17:50:32 -0400 (EDT)

 
 > So the only place where there's a problem is between the signal(deferred) 
 > call and the accept().  
 
 Yes, that it the small place I am talking about.  If you put a pause()
 in right before the accept I think SIGUSR1 will be ignored by the
 child and it will run until the generation check catches it in the top
 of the child_main() loop.  I think that multiple KeepAlive requests
 would still be run, though.
 
 > But if we do what you suggest then we run into a
 > problem *after* the accept call.  We may have accepted a connection, and
 > then get hit with a signal before we can disable the longjmp.  That is
 > something I deliberately tried to avoid.  We also can't trust the values
 > of any of the local variables, so it'd be hard to know if we accepted a
 > connection or if we're just supposed to die or what.
 
 That is the problem, and is why I didn't submit a real patch.  I wasn't
 able to test it well now, so I didn't try.  I was actually doing this
 in separate code, and only saw the problem in Apache after I caught it
 in my own.
 
 > Your longjmp thing is clever though.  But I can't think of how to test if
 > we really got a connection after the loop exits due to deferred_die being
 > set.  For example, we could get the signal after accept() returns but
 > before csd is set.  race conditions rule. 
 
 I thought I came up with a better solution using the long jump, but
 now that you point it out it will fail if the signal comes between the
 return of the accept() and the assignment to the socket variable.  The
 way my while loop is done avoids problems if the signal comes later
 than that though. (I think I avoid the clobbering of local variables
 by making it into function that was passed a connection record, but
 perhaps I'm wrong.)  In case it's useful, here's the code I currently
 have for my other server:
 
 int connection_accept(int server_socket, connection_rec *connection)
 {
   int address_length;
 
   /* be certain that the socket is initially invalid */
   connection->socket = -1;
 
   /* make a SIGHUP set global_server_stop and then jump to here */
   if (sigsetjmp(global_jump_buffer, 1) == 0) {
     signal_set(SIGHUP, signal_stop_server_and_jump);
   }
 
   /* NOTE: a long jump is necessary to avoid the possibility of a signal
      coming after the loop is started and before the accept.  Without
      the jump we might wait an awful long time in accept (until another
      signal arrives) */
 
   while (! global_stop_server) {
 
     /* a SIGHUP here sets global_stop_server and jumps to above */
 
     address_length = sizeof(connection->address);
     connection->socket = accept(server_socket, 
                                 (struct sockaddr *)&(connection->address), 
                                 &address_length);
 
     /* don't loop if we sucessfully accepted */
     if (connection->socket >= 0) break;
 
     /* try again if we were interrupted */
     if (errno == EINTR) {
       continue;
     }
 
     /* return in ERROR if something bad happened */
     signal_set(SIGHUP, signal_stop_server);
     log_error("couldn't accept a new request");
     log_unix_error("accept");
     return ERROR;
   }
 
   /* reset the signal handler to simply set global_server_stop */
   signal_set(SIGHUP, signal_stop_server);
 
   /* socket will be -1 if a signal sets global_stop_server before the accept */
   if (connection->socket < 0) {
     return ERROR;
   }
 
   /* NOTE: global_stop_server was set if there was a SIGHUP after the accept */
   return SUCCESS;
 }
 
 > I've been able to run a "while 1; kill -USR1" loop against the server and
 > surf without a broken link with the current code.  But before I had the
 > deferred stuff in there I did have the slight race condition I talk about
 > above -- where an accept may succeed and then get signalled to death.  And
 > when that was in there I did get broken links while surfing. 
 
 That agrees with what I see, since the current code is conservative.
 The only problem is that the child occasionally is allowed to live one
 connection beyond when it should.  But it is a minor problem in that
 it never kills things after the connection is accepted.
 
 > On architectures with serialization in that loop the current code lets one
 > child live for at most one more request.  On other architectures, where
 > everyone gets plopped into accept() and the OS gets to wake 'em up, it's
 > possible for some children to be stuck "gracefully exiting" if the OS
 > starves them at the accept(). 
 
 You're right -- if they are serialized there should be only one
 process that is affected by this.  But it is one more connection, not
 one more request.  Perhaps deferred_die could be checked as part of the
 read_request() loop?
 
 Good luck!
 
 --nate
 
 ps. I've got a Linux ThinkPad 560 program that might be worth putting
 on your page if you are still maintaining it.  I'll send that
 separately.
 
 

Mime
View raw message