spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Zhitao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS
Date Sun, 07 Oct 2018 13:31:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641054#comment-16641054
] 

Yan Zhitao commented on SPARK-20415:
------------------------------------

I have similar issue but the thread dump has minor difference. My thread dump is as following:

"Executor task launch worker for task 3806" #54 daemon prio=5 os_prio=0 tid=0x0000000001230800
nid=0x1fc runnable [0x00007fba0e600000]
 java.lang.Thread.State: RUNNABLE
 at java.lang.StringCoding.encode(StringCoding.java:364)
 at java.lang.String.getBytes(String.java:941)
 at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:109)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
 at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
 at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
 at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
 at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
 at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
 at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
 at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:99)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

 

The thread will keep running and the jave process will use 1 core 100% without stopping. 

> SPARK job hangs while writing DataFrame to HDFS
> -----------------------------------------------
>
>                 Key: SPARK-20415
>                 URL: https://issues.apache.org/jira/browse/SPARK-20415
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, YARN
>    Affects Versions: 2.1.0
>         Environment: EMR 5.4.0
>            Reporter: P K
>            Priority: Major
>
> We are in POC phase with Spark. One of the Steps is reading compressed json files that
come from sources, "explode" them into tabular format and then write them to HDFS. This worked
for about three weeks until a few days ago, for a particular dataset, the writer just hangs.
I logged in to the worker machines and see this stack trace:
> "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 tid=0x00007f6210352800 nid=0x4542
runnable [0x00007f61f52b3000]
>    java.lang.Thread.State: RUNNABLE
>     at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
>     at org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311)
>     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown
Source)
>     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
>     at org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111)
>     at org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109)
>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
>     at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>     at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
>     at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
>     at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>     at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>     at org.apache.spark.scheduler.Task.run(Task.scala:99)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> The last messages ever printed in stderr before the hang are:
> 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at NativeMethodAccessorImpl.java:0)
> 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[31] at
save at NativeMethodAccessorImpl.java:0), which has no missing parents
> 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated
size 170.5 KB, free 2.2 GB)
> 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory
(estimated size 65.2 KB, free 2.2 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xxx.xx:37907
(size: 65.2 KB, free: 2.2 GB)
> 17/04/18 01:41:14 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:996
> 17/04/18 01:41:14 INFO DAGScheduler: Submitting 9 missing tasks from ResultStage 4 (MapPartitionsRDD[31]
at save at NativeMethodAccessorImpl.java:0)
> 17/04/18 01:41:14 INFO YarnScheduler: Adding task set 4.0 with 9 tasks
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 72, xxx.xxx.xx.xx.xx,
executor 12, partition 1, NODE_LOCAL, 8184 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 73, xxx.xxx.xx.xx.xx,
executor 13, partition 0, NODE_LOCAL, 7967 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 2.0 in stage 4.0 (TID 74, xxx.xxx.xx.xx.xx,
executor 14, partition 2, NODE_LOCAL, 8181 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 6.0 in stage 4.0 (TID 75, xxx.xxx.xx.xx.xx,
executor 16, partition 6, NODE_LOCAL, 8400 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 7.0 in stage 4.0 (TID 76, xxx.xxx.xx.xx.xx,
executor 10, partition 7, NODE_LOCAL, 8398 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 3.0 in stage 4.0 (TID 77, xxx.xxx.xx.xx.xx,
executor 11, partition 3, NODE_LOCAL, 8182 bytes)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:46030
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:40494
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:35861
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:34157
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:43202
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:46053
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:46030
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO ExecutorAllocationManager: Requesting 9 new executors because
tasks are backlogged (new desired total will be 9)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:34157
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:40494
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:35861
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:46053
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:43202
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:17 INFO TaskSetManager: Starting task 4.0 in stage 4.0 (TID 78, xxx.xxx.xx.xx.xx,
executor 15, partition 4, RACK_LOCAL, 8400 bytes)
> 17/04/18 01:41:17 INFO TaskSetManager: Starting task 5.0 in stage 4.0 (TID 79, xxx.xxx.xx.xx.xx,
executor 9, partition 5, RACK_LOCAL, 8400 bytes)
> 17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:34045
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:43887
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:34045
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:43887
(size: 28.0 KB, free: 4.0 GB)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message