Date: Mon, 12 Apr 1999 12:58:46 -0700 (PDT)
From: <unknown@riverstyx.net>
To: new-httpd@apache.org
Subject: Re: apache-apr
In-Reply-To: <19990412025949.A14370@io.com>
Message-ID: <Pine.LNX.4.04.9904121243320.15404-100000@hades.riverstyx.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: new-httpd-owner@apache.org
Precedence: bulk
Reply-To: new-httpd@apache.org

I dropped the MaxRequestsPerChild down to 5 so the server'd die even
quicker.  I went and found the process that was in the state you
described, and here's what I got:

[root@hades apache-apr]# ps fax|grep httpd|awk '{print $1}'|perl -ne
'chop; system "lsof -p $_";'|grep wW
httpd-apr 15359 root   53wW  REG    3,5        0  122969
/usr/local/apache-apr/logs/accept.lock.1 (deleted)
[root@hades apache-apr]# gdb httpd-apr 15359
GNU gdb 4.17.0.4 with Linux/x86 hardware watchpoint and FPU support
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "i386-redhat-linux"...

/usr/local/apache-apr/15359: No such file or directory.
Attaching to program `/usr/local/apache-apr/httpd-apr', Pid 15359
Reading symbols from /lib/libm.so.6...done.
Reading symbols from /lib/libcrypt.so.1...done.
Reading symbols from /lib/libdb.so.2...done.
Reading symbols from /lib/libpthread.so.0...wdone.
Reading symbols from /lib/libc.so.6...hdone.
Reading symbols from /lib/ld-linux.so.2...done.
Reading symbols from /lib/libnss_files.so.1...edone.
Reading symbols from /lib/libnss_nis.so.1...rdone.
Reading symbols from /lib/libnsl.so.1...done.
Reading symbols from /lib/libnss_dns.so.1...done.
Reading symbols from /lib/libresolv.so.2...done.
0x400e55c2 in __libc_accept ()
(gdb) where
#0  0x400e55c2 in __libc_accept ()
#1  0x4006c0c1 in accept (fd=16, addr={__sockaddr__ = 0xbd5ffd48, 
      __sockaddr_at__ = 0xbd5ffd48, __sockaddr_ax25__ = 0xbd5ffd48, 
      __sockaddr_dl__ = 0xbd5ffd48, __sockaddr_eon__ = 0xbd5ffd48, 
      __sockaddr_in__ = 0xbd5ffd48, __sockaddr_in6__ = 0xbd5ffd48, 
      __sockaddr_inarp__ = 0xbd5ffd48, __sockaddr_ipx__ = 0xbd5ffd48, 
      __sockaddr_iso__ = 0xbd5ffd48, __sockaddr_ns__ = 0xbd5ffd48, 
      __sockaddr_un__ = 0xbd5ffd48, __sockaddr_x25__ = 0xbd5ffd48}, 
    addr_len=0xbd5ffd44) at wrapsyscall.c:146
#2  0x80768ed in accept_thread ()
#3  0x40069357 in pthread_start_thread (arg=0xbd5ffea4) at manager.c:192
(gdb) quit

After that, I checked with lsof again and found two threads in that
condition.  A second later, the older of the two threads terminated, but I
had time to check the remaining thread... it was in the exact same
condition.  I kept checking, got the same results...

Here's a backtrace from another thread in its process.  Not sure if I
caught it while it was still stuck tho... I'm pretty sure I did, 'coz this
is the same result I got from checking four other threads, which just
previously I had verified as having a stuck thread.

(gdb) where
#0  0x400e5164 in __syscall_sigsuspend ()
#1  0x401096cc in __DTOR_END__ ()
#2  0x40068513 in pthread_cond_wait (cond=0x80a0c44, mutex=0x80a0c2c)
    at restart.h:49
#3  0x808724f in queue_pop ()
#4  0x8076a2e in worker_thread ()
#5  0x40069357 in pthread_start_thread (arg=0xbf7ffea4) at manager.c:192

---
tani hosokawa
river styx internet


On Mon, 12 Apr 1999, Manoj Kasichainula wrote:

> I'm not surprised that you're seeing a process in this fcntl lock
> condition. As Ryan noted, this is a known problem which needs to be
> fixed. But, I don't think it should show up if every port the server
> listens to is hit by a client on a regular basis. Is this the case?
> 
> I would be surprised if all of the processes were in this fcntl
> locking state. The reason that this process is still waiting on the
> fcntl lock is that another process has the fcntl lock. And that
> process with the lock is supposed to be accepting. Are you actually
> seeing every thread in one of the states you described? If not, don't
> check yet, that's too much work.
> 
> If you have lsof, do the following (which can probably be automated):
> 
> 1. Get lsof.
> 2. run lsof on the main thread of each process, and look under the
> "FD" column for a capital W after a lower-case w, like this:
> 
> COMMAND PID  USER   FD   TYPE     DEVICE SIZE/OFF  NODE NAME
> (...)
> httpd   568 manoj   17wW  REG        3,6        0 16411 /home (/dev/hda6)
>                        ^
> The important part_____|
> 
> Just running lsof on every httpd thread and grepping for a "W" or "wW"
> should make this easy.
> 
> 3. pull up a gdb backtrace of this thread, assuming there is one. It
> is supposed to be accepting requests, but it is probably not. What is
> it doing instead?
> 
> 4. If you can get a backtrace of the other threads of that process
> (pstree -p will tell you which other threads are in that process),
> that would be great, too.
> 
> Thanks for your help.
>