httpd-bugs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jorge Román Novalbos <jro...@linux-it.es>
Subject Re: httpd hangs ?
Date Thu, 22 Mar 2012 13:08:52 GMT
if you want you can set the apache directive "CoreDumpDirectory /tmp/apache2-gdb-dump" and
analize later the segmentation fault cause with gdb.

Jorge


On 22/03/2012, at 14:03, Pawel wrote:

> It seems that always directly before hangs -  one of the child process is killed
> [notice] child pid 3818 exit signal Segmentation fault (11)
> 
> It seems that some kind of deadlock appears and main process is waiting for something..
> 
> Probably apache worker configuration has no matter. 
> Number of request per second  change only probability of the signal. 
> 
> I used another machine with 4 CPU/4G Ram , the same system (disk image copy) and there
is  no segmentation fault and  no hangs. Because  of capacity, the system is handling only
part of production traffic...
> 
> Pawel
> 
> 
> 
> 
> 
> 
> 
> W dniu 2012-03-14 10:35, Jorge Román Novalbos pisze:
>> 
>> I would try several things:
>> 
>> 1- Get a system profile during whole day each minute if is possible. If you have
monitorization tools like nagios or cacti you can use it.
>> 2- Isolate the server and try to stress it in order to reproduce the problems and
see the system profile. You could use ab o jmeter for this.
>> 3- If you don't see the problem, I'll try to desactivate the logging to disk. I don't
know is you can effort that, but generally log to disk reduce the apache performance. 
>> 
>>  Question:
>> 
>> Whenever apache reaches 200 request per sec the apache hangs, but this behavior happens
always apache reach that threshold??? We have to try to find some kind behavior patter for
this.
>> 
>> Jorge.
>> 
>> 
>> 
>> On 14/03/2012, at 09:52, Pawel wrote:
>> 
>>> W dniu 2012-03-14 08:31, Jorge Román Novalbos pisze:
>>>> 
>>>> Ok, When the apache is hangs, how is the load average? Is high?
>>> =1
>>> When the apache is not running or dead - there is no other activities on server
(without system task as cron job, syslog etc.)
>>>> 
>>>> 
>>>> Could you get a system picture at the problem moment, i mean load, top, free,
iostat -x 1,  an apache server status, ps -aux in order to find out where the bottleneck is.
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz
  await r_await w_await  svctm  %util
>>> sda               0.00     0.00    1.00    4.00     8.00    16.00     9.60  
  0.00    0.60    3.00    0.00   0.60   0.30
>>> etherd!e2.3       0.00     0.00  156.00   40.00    24.00   160.00     8.00  
  0.00    0.24    0.27    0.12   0.24   0.70
>>> 
>>> ps -auxf has been attached in the 1-st mail
>>> system is  not swapping. 
>>> 
>>>> 
>>>> Questions: 
>>>> What kind of request are you serving? only statics objects or also dinamic?
php, java, ruby?
>>> It seems that http process is not a worker - It listens on socket, there is no
any established on waited connection to the process. It seems that it not server any content.
>>> But generally apache use php and ajp12 modules.
>>> Dynamic content - 93%
>>> 
>>>> 
>>>> Do you know the apache threads size?
>>> stack > 8196  but < 16 000 (because of ulimit)
>>> on the apache  2.2.19 8196 was  enough.
>>> 
>>> information about "normal" - alive apache process:
>>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>> 16938 www       20   0  129m  11m 3064 S    2  0.1   0:00.07 httpd
>>> 
>>> LDD
>>>        linux-vdso.so.1 =>  (0x000073af9d1e7000)
>>>         /lib64/libsafe.so.2 (0x000073af9cdc2000)
>>>         libz.so.1 => /lib64/libz.so.1 (0x000073af9cba9000)
>>>         libssl.so.1.0.0 => /usr/lib64/libssl.so.1.0.0 (0x000073af9c945000)
>>>         libcrypto.so.1.0.0 => /usr/lib64/libcrypto.so.1.0.0 (0x000073af9c560000)
>>>         libm.so.6 => /lib64/libm.so.6 (0x000073af9c2dd000)
>>>         libexpat.so.1 => /usr/lib64/libexpat.so.1 (0x000073af9c0af000)
>>>         libuuid.so.1 => /lib64/libuuid.so.1 (0x000073af9beaa000)
>>>         librt.so.1 => /lib64/librt.so.1 (0x000073af9bca1000)
>>>         libcrypt.so.1 => /lib64/libcrypt.so.1 (0x000073af9ba6a000)
>>>         libpthread.so.0 => /lib64/libpthread.so.0 (0x000073af9b84d000)
>>>         libdl.so.2 => /lib64/libdl.so.2 (0x000073af9b649000)
>>>         libc.so.6 => /lib64/libc.so.6 (0x000073af9b2bd000)
>>>         /lib64/ld-linux-x86-64.so.2 (0x000073af9cfc9000)
>>> 
>>>> 
>>>> Hardware features, RAM, CPU, is a virtual o phisical server?
>>> phisical, havy loaded, 16 CPU, 16G RAM.
>>> there is > 200 vhosts, apache keeps opened  ~ 600 log files. At least 2 files
are common for user logs of  ~50 vhosts (php), (~200MB logs per day per file).
>>> 
>>>> 
>>>> Could you send me the mpm configuration? worker, prefork, maxclient, etc...
>>> 
>>> MaxKeepAliveRequests 50
>>> KeepAliveTimeout 2
>>> ListenBacklog 1
>>> 
>>> StartServers         80
>>> MinSpareServers      80
>>> MaxSpareServers      250
>>> MaxClients           250
>>> MaxRequestsPerChild  20
>>> 
>>> For tests I changed:
>>> MinSpareServers    =  MaxSpareServers      = MaxClients      = 250
>>> It works for ~23H. Lets see..
>>> 
>>> 
>>> It could be helpful:
>>> When apache process hags - no new chid process appears, some chids are zombie,
 stopping apache is possible by killing wih 9 all apache process. (They die after ~2 minutes)
>>> 
>>> 
>>>  
>>> Regards & Thanks 
>>> Pawel 
>>> 
>>> 
>>>> 
>>>> Thanks!
>>>> 
>>>> 
>>>> On 13/03/2012, at 16:53, Pawel wrote:
>>>> 
>>>>> Hi,
>>>>> No, I do not use NFS
>>>>> It seems that apache is not waiting for any filesystem.
>>>>> My "D" apache process keep opened only log files - located on local filesystem.
There is no I/O disk traffic.
>>>>> 
>>>>> Pawel
>>>>> 
>>>>> 
>>>>> W dniu 2012-03-13 15:09, Jorge Román Novalbos pisze:
>>>>>> 
>>>>>> Hi Pawel, 
>>>>>> 
>>>>>> I have got the same problem when I have network problem to reach
our NFS volume. Does NFS involved in any apache process??
>>>>>> 
>>>>>> I mean, the httpd binaries, logs o Documentroot are using  NFS??
>>>>>> 
>>>>>> Jorge.
>>>>>> 
>>>>>> On 13/03/2012, at 15:03, Pawel wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> After upgrading to 2.2.22 (from 2.2.19 ) my apache stop responding
to network queries.
>>>>>>> It happens on quite busy system (~200 workers ),  ~ one per day.
>>>>>>> one apache process is  in "D" state. 
>>>>>>> 
>>>>>>> Apache is running on 3.2.2 kernel, ,  gcc 4.5.3-r1 p1.0, pie-0.4.5
>>>>>>> 
>>>>>>> 
>>>>>>> Is it know bug? 
>>>>>>> Anyone can see that behavior ?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Pawel 
>>>>>>> 
>>>>>>> 
>>>>>>> www      19504  0.0  0.0 141084 19828 ?        Ds   Mar09   0:44
/usr/local/apache22/bin/httpd -k start
>>>>>>> www      12773  0.3  0.0      0     0 ?        Z    09:40   0:01
 \_ [httpd] <defunct>
>>>>>>> www      12815  0.4  0.0      0     0 ?        Z    09:41   0:01
 \_ [httpd] <defunct>
>>>>>>> www      12844  0.2  0.0      0     0 ?        Z    09:42   0:00
 \_ [httpd] <defunct>
>>>>>>> www      12876  0.2  0.0      0     0 ?        Z    09:42   0:01
 \_ [httpd] <defunct>
>>>>>>> www      12896  0.2  0.0      0     0 ?        Z    09:43   0:00
 \_ [httpd] <defunct>
>>>>>>> www      12918  0.2  0.0      0     0 ?        Z    09:44   0:00
 \_ [httpd] <defunct>
>>>>>>> www      12946  0.2  0.0      0     0 ?        Z    09:44   0:00
 \_ [httpd] <defunct>
>>>>>>> www      12968  0.2  0.0      0     0 ?        Z    09:45   0:00
 \_ [httpd] <defunct>
>>>>>>> www      13001  0.5  0.0      0     0 ?        Z    09:45   0:01
 \_ [httpd] <defunct>
>>>>>>> www      13020  0.5  0.0      0     0 ?        Z    09:46   0:01
 \_ [httpd] <defunct>
>>>>>>> www      13036  2.2  0.0      0     0 ?        Z    09:46   0:04
 \_ [httpd] <defunct>
>>>>>>> www      13057  0.5  0.0      0     0 ?        Z    09:47   0:00
 \_ [httpd] <defunct>
>>>>>>> www      13077  2.7  0.0      0     0 ?        Z    09:47   0:03
 \_ [httpd] <defunct>
>>>>>>> www      13105  1.3  0.0      0     0 ?        Z    09:48   0:01
 \_ [httpd] <defunct>
>>>>>>> www      13135  1.1  0.0 159492 24208 ?        SL   09:48   0:00
 \_ /usr/local/apache22/bin/httpd -k start
>>>>>>> www      13171  0.5  0.0 156436 21408 ?        SL   09:49   0:00
 \_ /usr/local/apache22/bin/httpd -k start
>>>>>>> www      13210  0.0  0.0 141084  9628 ?        R    09:49   0:00
 \_ /usr/local/apache22/bin/httpd -k start
>>>>>>> 
>>>>>>> strace (one message per ~ 30 seconds )
>>>>>>> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x73be642dda10) = 13701
>>>>>>> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x73be642dda10) = 13720
>>>>>>> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x73be642dda10) = 13740
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 12773
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 12815
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 12844
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 12876
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 12896
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 12918
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 12946
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 12968
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13001
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13020
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13036
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13057
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13077
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13105
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13135
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13171
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13210
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13245
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13267
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13291
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13310
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13331
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13351
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13372
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13409
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13431
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13449
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13475
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13502
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13564
>>>>>>> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED,
NULL) = 13588
>>>>>>> wait4(-1, 0x7ad781767684, WNOHANG|WSTOPPED, NULL) = 0
>>>>>>> select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
>>>>>>> write(2, "[Tue Mar 13 10:00:01 2012] [info"..., 179) = 179
>>>>>>> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x73be642dda10) = 13763
>>>>>>> --- {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13606, si_status=0,
si_utime=55, si_stime=24} (Child exited) ---
>>>>>>> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x73be642dda10) = 13782
>>>>>>> --- {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13639, si_status=0,
si_utime=38, si_stime=14} (Child exited) ---
>>>>>>> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x73be642dda10) = 13810
>>>>>>> --- {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13740, si_status=0,
si_utime=89, si_stime=34} (Child exited) ---
>>>>>>> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x73be642dda10) = 13846
>>>>>>> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x73be642dda10) = 13870
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 


Mime
View raw message