hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amy (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HAWQ-1371) QE process hang in shared input scan
Date Wed, 01 Mar 2017 08:06:45 GMT
Amy created HAWQ-1371:
-------------------------

             Summary: QE process hang in shared input scan
                 Key: HAWQ-1371
                 URL: https://issues.apache.org/jira/browse/HAWQ-1371
             Project: Apache HAWQ
          Issue Type: Bug
          Components: Query Execution
            Reporter: Amy
            Assignee: Lei Chang
             Fix For: backlog


process hang on some segment node while QD and QE on other segment nodes terminated.

{code}
on segment test2:
[gpadmin@test2 ~]$ pp
gpadmin   21614  0.0  1.2 788636 407428 ?       Ss   Feb26   1:19 /usr/local/hawq_2_1_0_0/bin/postgres
-D /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-YARN/product/segmentdd -p 31100 --silent-mode=true
-M segment -i
gpadmin   21615  0.0  0.0 279896  6952 ?        Ss   Feb26   0:08 postgres: port 31100, logger
process
gpadmin   21618  0.0  0.0 282128  6980 ?        Ss   Feb26   0:00 postgres: port 31100, stats
collector process
gpadmin   21619  0.0  0.0 788636  7280 ?        Ss   Feb26   0:11 postgres: port 31100, writer
process
gpadmin   21620  0.0  0.0 788636  7064 ?        Ss   Feb26   0:01 postgres: port 31100, checkpoint
process
gpadmin   21621  0.0  0.0 793048 11752 ?        S    Feb26   0:19 postgres: port 31100, segment
resource manager
gpadmin   91760  0.0  0.0 861000 16840 ?        TNsl Feb26   0:07 postgres: port 31100, gpadmin
parquetola... 10.32.35.141(15250) con558 seg4 cmd2 slice11 MPPEXEC SELECT
gpadmin   91762  0.0  0.0 861064 17116 ?        SNsl Feb26   0:08 postgres: port 31100, gpadmin
parquetola... 10.32.35.141(15253) con558 seg5 cmd2 slice11 MPPEXEC SELECT
gpadmin  216648  0.0  0.0 103244   788 pts/0    S+   19:54   0:00 grep postgres
{code}

QE stack trace is:
{code}
(gdb) bt
#0  0x00000032214e1523 in select () from /lib64/libc.so.6
#1  0x000000000069c2fa in shareinput_writer_waitdone (ctxt=0x1dae520, share_id=0, nsharer_xslice=7)
at nodeShareInputScan.c:989
#2  0x0000000000695798 in ExecEndMaterial (node=0x1d2eb50) at nodeMaterial.c:512
#3  0x000000000067048d in ExecEndNode (node=0x1d2eb50) at execProcnode.c:1681
#4  0x000000000069c6b5 in ExecEndShareInputScan (node=0x1d2e6f0) at nodeShareInputScan.c:382
#5  0x000000000067042a in ExecEndNode (node=0x1d2e6f0) at execProcnode.c:1674
#6  0x00000000006ac9be in ExecEndSequence (node=0x1d23890) at nodeSequence.c:165
#7  0x00000000006705f0 in ExecEndNode (node=0x1d23890) at execProcnode.c:1583
#8  0x000000000069a0ab in ExecEndResult (node=0x1d214a0) at nodeResult.c:481
#9  0x000000000067060d in ExecEndNode (node=0x1d214a0) at execProcnode.c:1575
#10 0x000000000069a0ab in ExecEndResult (node=0x1d20860) at nodeResult.c:481
#11 0x000000000067060d in ExecEndNode (node=0x1d20860) at execProcnode.c:1575
#12 0x0000000000698fd2 in ExecEndMotion (node=0x1d20320) at nodeMotion.c:1230
#13 0x0000000000670434 in ExecEndNode (node=0x1d20320) at execProcnode.c:1713
#14 0x0000000000669da7 in ExecEndPlan (planstate=0x1d20320, estate=0x1cb6b40) at execMain.c:2896
#15 0x000000000066a311 in ExecutorEnd (queryDesc=0x1cabf20) at execMain.c:1407
#16 0x00000000006195f2 in PortalCleanupHelper (portal=0x1cbcc40) at portalcmds.c:365
#17 PortalCleanup (portal=0x1cbcc40) at portalcmds.c:317
#18 0x0000000000900544 in AtAbort_Portals () at portalmem.c:693
#19 0x00000000004e697f in AbortTransaction () at xact.c:2800
#20 0x00000000004e7565 in AbortCurrentTransaction () at xact.c:3377
#21 0x00000000007ed0fa in PostgresMain (argc=<value optimized out>, argv=<value optimized
out>, username=0x1b47f10 "gpadmin") at postgres.c:4630
#22 0x00000000007a05d0 in BackendRun () at postmaster.c:5915
#23 BackendStartup () at postmaster.c:5484
#24 ServerLoop () at postmaster.c:2163
#25 0x00000000007a3399 in PostmasterMain (argc=Unhandled dwarf expression opcode 0xf3
) at postmaster.c:1454
#26 0x00000000004a52e9 in main (argc=9, argv=0x1b0cd10) at main.c:226
(gdb) p CurrentTransactionState->state
$1 = TRANS_ABORT
(gdb) p pctxt->donefd
No symbol "pctxt" in current context.
(gdb) f 1
#1  0x000000000069c2fa in shareinput_writer_waitdone (ctxt=0x1dae520, share_id=0, nsharer_xslice=7)
at nodeShareInputScan.c:989
989    	nodeShareInputScan.c: No such file or directory.
       	in nodeShareInputScan.c
(gdb) p pctxt->donefd
$2 = 15
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message