hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming LI (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HAWQ-1408) PANICs during COPY ... FROM STDIN
Date Fri, 24 Mar 2017 14:17:41 GMT
Ming LI created HAWQ-1408:
-----------------------------

             Summary: PANICs during COPY ... FROM STDIN
                 Key: HAWQ-1408
                 URL: https://issues.apache.org/jira/browse/HAWQ-1408
             Project: Apache HAWQ
          Issue Type: Bug
          Components: Core
            Reporter: Ming LI
            Assignee: Ed Espino
             Fix For: 2.1.0.0-incubating


We found PANIC (and respective core dumps). From the initial analysis from the logs and core
dump, the query causing this PANIC is a "COPY ... FROM STDIN". This query does not always
panic.
This kind of queries are executed from Java/Scala code (by one of IG Spark Jobs). Connection
to the DB is managed by connection pool (commons-dbcp2) and validated on borrow by “select
1” validation query. IG is using postgresql-9.4-1206-jdbc41 as a java driver to create those
connections. I believe they should be using the driver from DataDirect, available in PivNet;
however, I haven't found hard evidence pointing the driver as a root cause.

My initial analysis on the packcore for the master PANIC. Not sure if this helps or makes
sense.

This is the backtrace of the packcore for process 466858:

{code}
(gdb) bt
#0  0x00007fd875f906ab in raise () from /data/logs/52280/packcore-core.postgres.466858/lib64/libpthread.so.0
#1  0x00000000008c0b19 in SafeHandlerForSegvBusIll (postgres_signal_arg=11, processName=<optimized
out>) at elog.c:4519
#2  <signal handler called>
#3  0x000000000053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, existing_segnos@entry=0x0,
relid=relid@entry=1195061, segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000',
keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
#4  0x000000000053c08f in assignPerRelSegno (all_relids=all_relids@entry=0x2b96d68, segment_num=6)
at appendonlywriter.c:1212
#5  0x00000000005f79e8 in DoCopy (stmt=stmt@entry=0x2b2a3d8, queryString=<optimized out>)
at copy.c:1591
#6  0x00000000007ef737 in ProcessUtility (parsetree=parsetree@entry=0x2b2a3d8, queryString=0x2c2f550
"COPY mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, tracking_id,
name, value_string, value_timestamp, value_number, value_boolean, environment, account, channel,
device, feat"...,
    params=0x0, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0
"") at utility.c:1076
#7  0x00000000007ea95e in PortalRunUtility (portal=portal@entry=0x2b8eab0, utilityStmt=utilityStmt@entry=0x2b2a3d8,
isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0
"") at pquery.c:1969
#8  0x00000000007ec13e in PortalRunMulti (portal=portal@entry=0x2b8eab0, isTopLevel=isTopLevel@entry=1
'\001', dest=dest@entry=0x2b2a7c8, altdest=altdest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0
"") at pquery.c:2079
#9  0x00000000007ede95 in PortalRun (portal=portal@entry=0x2b8eab0, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x2b2a7c8, altdest=altdest@entry=0x2b2a7c8,
completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1596
#10 0x00000000007e5ad9 in exec_simple_query (query_string=query_string@entry=0x2b29100 "COPY
mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, tracking_id, name,
value_string, value_timestamp, value_number, value_boolean, environment, account, channel,
device, feat"...,
    seqServerHost=seqServerHost@entry=0x0, seqServerPort=seqServerPort@entry=-1) at postgres.c:1816
#11 0x00000000007e6cb2 in PostgresMain (argc=<optimized out>, argv=<optimized out>,
argv@entry=0x29d7820, username=0x29d75d0 "mis_ig") at postgres.c:4840
#12 0x0000000000799540 in BackendRun (port=0x29afc50) at postmaster.c:5915
#13 BackendStartup (port=0x29afc50) at postmaster.c:5484
#14 ServerLoop () at postmaster.c:2163
#15 0x000000000079c309 in PostmasterMain (argc=<optimized out>, argv=<optimized out>)
at postmaster.c:1454
#16 0x00000000004a4209 in main (argc=9, argv=0x29af010) at main.c:226
{code}

Jumping into the frame 3 and running info locals, we found something odd for "status" variable:

{code}
(gdb) f 3
#3  0x000000000053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, existing_segnos@entry=0x0,
relid=relid@entry=1195061, segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000',
keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
1166	appendonlywriter.c: No such file or directory.
(gdb) info locals
status = 0x0
[...]
{code}

This panic comes from this piece of code in "appendonlywritter.c":

{code}
for (int i = 0; i < segment_num; i++)
{
    AOSegfileStatus *status = maxSegno4Segment[i];
    status->inuse = true;
    status->xid = CurrentXid;
    existing_segnos = lappend_int(existing_segnos,  status->segno);
}
{code}

So, we are pulling a 0x0 (null ?!) entry from _maxSegno4Segment_... That's extrange, because
earlier in this function we populate this array, and we should not reach this section unless
this _maxSegno4Segment_ array completely populated.

I'm assuming that the variable "segment_num" (used in the for loop stop condition) is 6 because
we got (4096 MB, 1.00 CORE) x 6 resources, according to master log (?)

{code}
2017-02-15 00:12:20.121004 GMT,"mis_ig","ig",p466858,th2037315872,"10.33.188.8","35247",2017-02-14
08:41:58 GMT,30177010,con20836,cmd48711,seg-10000,,,x30177010,sx1,"LOG","00000","ConnID 42180.
Acquired resource from resource manager, (4096 MB, 1.00
0000 CORE) x 6."
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message