From "Nathan Kleyn (JIRA)" <>
Subject [jira] [Created] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0
Date Tue, 27 Mar 2018 15:04:00 GMT
Nathan Kleyn created SPARK-23801:

             Summary: Consistent SIGSEGV after upgrading to Spark v2.3.0
                 Key: SPARK-23801
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, SQL
    Affects Versions: 2.3.0
            Reporter: Nathan Kleyn
         Attachments: spark-executor-failure.coredump.log

After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent segfaults in a
large Spark job (18 * r3.4xlarge 16 core boxes with 105G of executor memory). I've attached
the full coredump but here is an except:

# A fatal error has been detected by the Java Runtime Environment:
#  SIGSEGV (0xb) at pc=0x00007f1467427fdc, pid=1315, tid=0x00007f1464f2d700
# JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 1.8.0_161-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode linux-amd64 )
# Problematic frame:
# V  []  oopDesc* PSPromotionManager::copy_to_survivor_space<false>(oopDesc*)+0x7c
# Core dump written. Default location: /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
or core.1315
# If you would like to submit a bug report, please visit:
---------------  T H R E A D  ---------------

Current thread (0x00007f146005b000):  GCTaskThread [stack: 0x00007f1464e2d000,0x00007f1464f2e000]

siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 0x0000000000000000

RAX=0x17e907feccbc6d20, RBX=0x00007ef9c035f8c8, RCX=0x00007f1464f2c9f0, RDX=0x0000000000000000
RSP=0x00007f1464f2c1a0, RBP=0x00007f1464f2c210, RSI=0x0000000000000068, RDI=0x00007ef7bc30bda8
R8 =0x00007f1464f2c3d0, R9 =0x0000000000001741, R10=0x00007f1467a52819, R11=0x00007f14671240e0
R12=0x00007f130912c998, R13=0x17e907feccbc6d20, R14=0x0000000000000002, R15=0x000000000000000d
RIP=0x00007f1467427fdc, EFLAGS=0x0000000000010202, CSGSFS=0x002b000000000033, ERR=0x0000000000000000

Top of Stack: (sp=0x00007f1464f2c1a0)
0x00007f1464f2c1a0:   00007f146005b000 0000000000000001
0x00007f1464f2c1b0:   0000000000000004 00007f14600bb640
0x00007f1464f2c1c0:   00007f1464f2c210 00007f14673aeed6
0x00007f1464f2c1d0:   00007f1464f2c2c0 00007f1464f2c250
0x00007f1464f2c1e0:   00007f11bde31b70 00007ef9c035f8c8
0x00007f1464f2c1f0:   00007ef8a80a7060 0000000000001741
0x00007f1464f2c200:   0000000000000002 00000000ffffffff
0x00007f1464f2c210:   00007f1464f2c230 00007f146742b005
0x00007f1464f2c220:   00007ef8a80a7050 0000000000001741
0x00007f1464f2c230:   00007f1464f2c2d0 00007f14673ae9fb
0x00007f1464f2c240:   00007f1467a5d880 00007f14673ad9a0
0x00007f1464f2c250:   00007f1464f2c9f0 00007f1464f2c3d0
0x00007f1464f2c260:   00007f1464f2c3a0 00007f146005b620
0x00007f1464f2c270:   00007ef8b843d7c8 ffff000200000006
0x00007f1464f2c280:   00007f1464f2c340 00007f14600bb640
0x00007f1464f2c290:   17417f1453fb9cec 00007f1453fbffff
0x00007f1464f2c2a0:   00007f1453fb819e 00007f1464f2c3a0
0x00007f1464f2c2b0:   0000000000000001 0000000000000000
0x00007f1464f2c2c0:   00007f1464f2c3d0 00007f1464f2c9d0
0x00007f1464f2c2d0:   00007f1464f2c340 00007f1467025f22
0x00007f1464f2c2e0:   00007f145427cb5c 00007f1464f2c3a0
0x00007f1464f2c2f0:   00007f1464f2c370 00007f146005b000
0x00007f1464f2c300:   00007f1464f2c9f0 00007ef850009800
0x00007f1464f2c310:   00007f1464f2c9f0 00007f1464f2c3a0
0x00007f1464f2c320:   00007f1464f2c3d0 00007f146005b000
0x00007f1464f2c330:   00007f1464f2c9f0 00007ef850009800
0x00007f1464f2c340:   00007f1464f2c9c0 00007f1467508191
0x00007f1464f2c350:   00007ef9c16f7890 00007f1464f2c370
0x00007f1464f2c360:   00007f1464f2c9d0 0000000000000000
0x00007f1464f2c370:   00007ef9c035f8c0 00007f145427cb5c
0x00007f1464f2c380:   00007f145427ba90 00007ef900000000
0x00007f1464f2c390:   0000000000000078 00007ef9c035f8c0 

Instructions: (pc=0x00007f1467427fdc)
0x00007f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
0x00007f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
0x00007f1467427fdc:   48 8b 00 48 c1 e8 03 89 c2 48 8b 05 04 74 5e 00
0x00007f1467427fec:   83 e2 0f 3b 10 0f 82 fd 00 00 00 48 8b 45 a8 4e 

Register to memory mapping:

RAX=0x17e907feccbc6d20 is an unknown value
RBX=0x00007ef9c035f8c8 is pointing into the stack for thread: 0x00007ef850009800
RCX=0x00007f1464f2c9f0 is an unknown value
RDX=0x0000000000000000 is an unknown value
RSP=0x00007f1464f2c1a0 is an unknown value
RBP=0x00007f1464f2c210 is an unknown value
RSI=0x0000000000000068 is an unknown value
RDI=0x00007ef7bc30bda8 is pointing into metadata
R8 =0x00007f1464f2c3d0 is an unknown value
R9 =0x0000000000001741 is an unknown value
R10=0x00007f1467a52819: <offset 0xfc0819> in /usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/
at 0x00007f1466a92000
R11=0x00007f14671240e0: <offset 0x6920e0> in /usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/
at 0x00007f1466a92000
R12=0x00007f130912c998 is an oop

 - klass: 'org/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage50'
R13=0x17e907feccbc6d20 is an unknown value
R14=0x0000000000000002 is an unknown value
R15=0x000000000000000d is an unknown value

Stack: [0x00007f1464e2d000,0x00007f1464f2e000],  sp=0x00007f1464f2c1a0,  free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  []  oopDesc* PSPromotionManager::copy_to_survivor_space<false>(oopDesc*)+0x7c
V  []  PSRootsClosure<false>::do_oop(oopDesc**)+0x35
V  []  OopMapSet::all_do(frame const*, RegisterMap const*, OopClosure*,
void (*)(oopDesc**, oopDesc**), OopClosure*)+0x2fb
V  []  frame::oops_do_internal(OopClosure*, CLDClosure*, CodeBlobClosure*,
RegisterMap*, bool)+0xa2
V  []  JavaThread::oops_do(OopClosure*, CLDClosure*, CodeBlobClosure*)+0x161
V  []  ThreadRootsTask::do_it(GCTaskManager*, unsigned int)+0x6f
V  []  GCTaskThread::run()+0x12f
V  []  java_start(Thread*)+0x108

JavaThread 0x00007ef850009800 (nid = 1558) was being processed
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 2336  sun.misc.Unsafe.putLong(Ljava/lang/Object;JJ)V (0 bytes) @ 0x00007f14518c70cc [0x00007f14518c7080+0x4c]
J 20102 C2 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage50.processNext()V
(1030 bytes) @ 0x00007f145427cb5c [0x00007f145427c020+0xb3c]
J 9304 C2 scala.collection.Iterator$$anon$11.hasNext()Z (10 bytes) @ 0x00007f145280da10 [0x00007f145280d460+0x5b0]
J 15346 C2 org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Lscala/collection/Iterator;)V
(117 bytes) @ 0x00007f145227172c [0x00007f1452271680+0xac]
J 16755 C1 org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Lorg/apache/spark/scheduler/MapStatus;
(293 bytes) @ 0x00007f14534a1dbc [0x00007f145349f820+0x259c]
J 16754 C1 org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;
(6 bytes) @ 0x00007f14536cf5cc [0x00007f14536cf540+0x8c]
J 15858 C1;)Ljava/lang/Object;
(399 bytes) @ 0x00007f1452eccd44 [0x00007f1452eca8a0+0x24a4]
J 16786 C1 org.apache.spark.executor.Executor$ (2984 bytes) @ 0x00007f1453a4c97c
J 18919 C1 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
(225 bytes) @ 0x00007f1453fb91cc [0x00007f1453fb81c0+0x100c]
j  java.util.concurrent.ThreadPoolExecutor$
v  ~StubRoutines::call_stub{code}
Unfortunately, this job is so large that it's pretty impossible for us to narrow down to a
reproducible test case. What I can say though is that:
 * We are running on Mesos using fine grained scheduling.
 * We can make it fail every time, consistently.
 * It only happened after we upgraded to v2.3.0.
 * All inputs and options to the job are _exactly_ the same before as after.

Please let me know if we can provide any other information!

