httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Richards <p.richa...@elsevier.co.uk>
Subject SIGUSR1
Date Thu, 21 Nov 1996 15:16:53 GMT

I looked into this last night and this morning and my conclusion is
that it's totally and utterly broken.

With the current sources I see *exactly* the same behaviour from both
FreeBSD and Solaris. We should simply disable it for this release.
If you accidentally SIGUSR1 a child process you get a fork storm.

I can't make any sense of the code, it looks to me like when a gracful
restart occurs the parent will spawn off another batch of children.

A subsequent -HUP fails I think because the parent is freeing the
socket but this only seems to happen after a USR1 and I haven't even
started looking into why this happens so I may off track on that bit.

This is what happens:

Script started on Thu Nov 21 14:44:52 1996
Enable tcsh features
dpr@tees:~/httpd% ./httpd -d .
Non-graceful restart, 2855
Made new socket, 2855
dpr@tees:~/httpd% ps -aefj | grep httpd
     dpr  2856  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2857  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2859  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2860  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2855     1  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2858  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2862  2845  2861  2845  1 14:45:11 pts/6    0:00 grep httpd
dpr@tees:~/httpd% kill -USR1 `cat logs/httpd.pid `
dpr@tees:~/httpd% ps -aefj | grep httpd
     dpr  2856  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2870  2845  2869  2845  1 14:45:29 pts/6    0:00 grep httpd
     dpr  2866  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
     dpr  2857  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2859  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2860  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2855     1  2855  2855  1 14:45:00 ?        0:00 ./httpd -d .
     dpr  2858  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2868  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
     dpr  2865  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
     dpr  2864  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
     dpr  2867  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .

[ I requested a page from the server at this point which caused child
  process 2870 to be reaped  and 2872 to be created ]

dpr@tees:~/httpd% ps -aefj | grep httpd
     dpr  2856  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2866  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
     dpr  2857  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2859  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2860  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2855     1  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2858  2855  2855  2855  0 14:45:00 ?        0:00 ./httpd -d .
     dpr  2868  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
     dpr  2865  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
     dpr  2864  2855  2855  2855  1 14:45:27 ?        0:00 ./httpd -d .
     dpr  2867  2855  2855  2855  0 14:45:27 ?        0:00 ./httpd -d .
     dpr  2872  2845  2871  2845  1 14:45:47 pts/6    0:00 grep httpd
dpr@tees:~/httpd% kill -USR1 `cat logs/httpd.pid `
dpr@tees:~/httpd%
I'm dead, 2866
I'm dead, 2856
I'm dead, 2865
I'm dead, 2864
I'm dead, 2857
I'm dead, 2858
I'm dead, 2859
I'm dead, 2867
I'm dead, 2868
I'm dead, 2860
Non-graceful restart, 2855
bind: Address already in use
httpd: could not bind to port 8000

script done on Thu Nov 21 14:46:12 1996

Which I guess is because process 2855 is still bound to it since I
don't see anywhere where it closes that socket.

Now, looking at the code, the above is what I'd expect to happen:

(http_main.c, line 1775)

    if(!one_process && !is_graceful)
    {
#ifndef NO_KILLPG
      if (killpg(pgrp,SIGHUP) < 0)    /* Kill 'em off */
#else 
      if (kill(-pgrp,SIGHUP) < 0)
#endif
        log_unixerr ("killpg SIGHUP", NULL, NULL, server_conf);
    }
    
    if(is_graceful)
    {
    log_error("SIGUSR1 received.  Doing graceful restart",server_conf);
    kill_cleanups_for_fd(pconf,sd);
    }
    else if (sd != -1 || listenmaxfd != -1) {
    reclaim_child_processes(); /* Not when just starting up */
    log_error ("SIGHUP received.  Attempting to restart", server_conf);
    }

At this point the config files are read and then

    
    if (listeners == NULL)
    {   
        if(!is_graceful)
        {
printf("Non-graceful restart, %d\n", getpid());
        memset((char *) &sa_server, 0, sizeof(sa_server));
        sa_server.sin_family=AF_INET;
        sa_server.sin_addr=bind_address;
        sa_server.sin_port=htons(server_conf->port);
        
        sd = make_sock(pconf, &sa_server);
printf("Made new socket, %d\n", getpid());
        }
    else
        sd=saved_sd;

followed by stuff if there are listeners and then it spawns off
another batch of children.

I can't see anywhere where children are reaped if the signal is USR1
and the code always falls through to the point where more are created,
the offending bit of code is

   1846         
   1847     num_children = 0;
   1848
   1849     if (daemons_max_free < daemons_min_free + 1) /* Don't thrash... */
   1850     daemons_max_free = daemons_min_free + 1;
   1851     
   1852     while (num_children < daemons_to_start) {
   1853     make_child(server_conf, num_children++);
   1854     }

I wonder if a simple

 if (!is_graceful)
	num_children = 0;

would help??

I suspect that by this point the parent is seriously confused since it
thinks it only has "daemons_to_start" children when it fact it has 
"daemons_to_start * generations" children. I didn't follow this
through to the scoreboard so I've no idea what that looks like in
these circumstances.

Now, I'm not sure exactly what's going on but the line

1823     sd=saved_sd;   

is executed during a SIGUSR1 and I'm betting that this bit of code is
what is screwing up the parent restart on a subsequent SIGHUP since
the error suggests that the parent isn't releasing the socket (it's
the only process still running at that point after a HUP).

All in all, this looks very hacked together and unless someone has a
quick fix we should disable it and take a fresh look at the whole
forking model for 2.0, which we'd have to do anyway for threading.

Incidentally, while I was wandering through the code, why
HARD_SERVER_LIMIT? Shouldn't we know what children we have? If we did
and we were checking internal state more consistently we'd realise
something was going wrong.

931     for (i = 0; i < HARD_SERVER_LIMIT; ++i) {
932     int pid = scoreboard_image->servers[i].pid;
933         
934     if (pid != my_pid && pid != 0)
935         waitpid (scoreboard_image->servers[i].pid, &status, 0);
936     } 

Also, we really need to sort out a style guide, the code I've included
earlier in this message has a really odd style which I'm betting is
Ben's and is a nightmare to follow :-) The variations make things very
hard to follow even when they're just variations of a sane theme and
not due to a cat jumping on the keyboard :-)

-- 
  Paul Richards. Originative Solutions Ltd.  (Netcraft Ltd. contractor)
  Elsevier Science TIS online journal project.
  Email: p.richards@elsevier.co.uk
  Phone: 0370 462071 (Mobile), +44 (0)1865 843155

Mime
View raw message