hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lin Wen (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HAWQ-1284) HAWQ master is coredump when kill all process on master and standby
Date Thu, 19 Jan 2017 10:48:26 GMT
Lin Wen created HAWQ-1284:
-----------------------------

             Summary: HAWQ master is coredump when kill all process on master and standby
                 Key: HAWQ-1284
                 URL: https://issues.apache.org/jira/browse/HAWQ-1284
             Project: Apache HAWQ
          Issue Type: Bug
            Reporter: Lin Wen
            Assignee: Ed Espino


When hawq cluster is running(no active queries), kill all postgres processes in master(with
command "killall postgres") and then kill all processes in standby(with command "killall gpsyncmaster"),
hawq master will generate coredump randomly.

The callstack is:
#0  0x00000032214325e5 in raise () from /lib64/libc.so.6
#1  0x0000003221433dc5 in abort () from /lib64/libc.so.6
#2  0x00000000008cce7f in errfinish (dummy=Unhandled dwarf expression opcode 0xf3
) at elog.c:686
#3  0x00000000008cf032 in elog_finish (elevel=Unhandled dwarf expression opcode 0xf3
) at elog.c:1463
#4  0x00000000007d4912 in proc_exit_prepare (code=1) at ipc.c:153
#5  0x00000000007d4a38 in proc_exit (code=1) at ipc.c:93
#6  0x00000000008ccc7e in errfinish (dummy=Unhandled dwarf expression opcode 0xf3
) at elog.c:670
#7  0x000000000078dea1 in ServiceDoConnect (listenerPort=64556, complain=Unhandled dwarf expression
opcode 0xf3
) at service.c:165
#8  0x00000000004efd5a in XLogQDMirrorWrite (WriteRqst=<value optimized out>, flexible=0
'\000', xlog_switch=0 '\000') at xlog.c:1981
#9  XLogWrite (WriteRqst=<value optimized out>, flexible=0 '\000', xlog_switch=0 '\000')
at xlog.c:2354
#10 0x00000000004f2242 in XLogFlush (record=...) at xlog.c:2572
#11 0x00000000004f7288 in CreateCheckPoint (shutdown=Unhandled dwarf expression opcode 0xf3
) at xlog.c:8136
#12 0x00000000004f9f72 in ShutdownXLOG (code=Unhandled dwarf expression opcode 0xf3
) at xlog.c:7865
#13 0x000000000078b2b0 in BackgroundWriterMain () at bgwriter.c:318
#14 0x000000000055a870 in AuxiliaryProcessMain (argc=<value optimized out>, argv=0x7fff02330850)
at bootstrap.c:467
#15 0x000000000079b4f0 in StartChildProcess (type=Unhandled dwarf expression opcode 0xf3
) at postmaster.c:6836
#16 0x000000000079b7aa in CommenceNormalOperations () at postmaster.c:3618
#17 0x000000000079fee4 in do_reaper () at postmaster.c:3831
#18 ServerLoop () at postmaster.c:2136
#19 0x00000000007a2179 in PostmasterMain (argc=Unhandled dwarf expression opcode 0xf3
) at postmaster.c:1454
#20 0x00000000004a4f99 in main (argc=9, argv=0x2a4f010) at main.c:226

The reason is the "WAL Send Server process" is killed firstly, when writer process gets a
shutdown request, it begins to create a checkpoint and sync xlog to standby master, however
at this point, wal send server process has been killed. So writer process failed in connecting
wal send server process, then ereport ERROR, 
				ereport(ERROR, (errcode(ERRCODE_GP_INTERCONNECTION_ERROR),
								errmsg("Could not connect to '%s': %s",
									   serviceConfig->title,
									   strerror(saved_err))));
line:165, service.c

>From the call stack we can see, when ereport() is called, proc_exit_prepare() will be
called. And at line:152, CritSectionCount is larger than 0, so PANIC occurs and a coredump
is generated. CritSectionCount is added when writer process calls XLogFlush().
	if (CritSectionCount > 0)
		elog(PANIC, "process is dying from critical section");
 
A possible solution is before writer process write log to standby, check if wal send server
process exists. If not, don't call call WalSendServerClientConnect() to connect wal send server
process. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message