spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "P K (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS
Date Thu, 20 Apr 2017 17:44:04 GMT

     [ https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

P K updated SPARK-20415:
------------------------
    Environment: EMR 5.4.0  (was: Latest EMR)

> SPARK job hangs while writing DataFrame to HDFS
> -----------------------------------------------
>
>                 Key: SPARK-20415
>                 URL: https://issues.apache.org/jira/browse/SPARK-20415
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, YARN
>    Affects Versions: 2.1.0
>         Environment: EMR 5.4.0
>            Reporter: P K
>            Priority: Blocker
>
> We are in POC phase with Spark. One of the Steps is reading compressed json files that
come from sources, "explode" them into tabular format and then write them to HDFS. This worked
for about three weeks until a few days ago, for a particular dataset, the writer just hangs.
I logged in to the worker machines and see this stack trace:
> "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 tid=0x00007f6210352800 nid=0x4542
runnable [0x00007f61f52b3000]
>    java.lang.Thread.State: RUNNABLE
>     at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
>     at org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311)
>     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown
Source)
>     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
>     at org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111)
>     at org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109)
>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
>     at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>     at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
>     at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
>     at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>     at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>     at org.apache.spark.scheduler.Task.run(Task.scala:99)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> The last messages ever printed in stderr before the hang are:
> 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at NativeMethodAccessorImpl.java:0)
> 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[31] at
save at NativeMethodAccessorImpl.java:0), which has no missing parents
> 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated
size 170.5 KB, free 2.2 GB)
> 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory
(estimated size 65.2 KB, free 2.2 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xxx.xx:37907
(size: 65.2 KB, free: 2.2 GB)
> 17/04/18 01:41:14 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:996
> 17/04/18 01:41:14 INFO DAGScheduler: Submitting 9 missing tasks from ResultStage 4 (MapPartitionsRDD[31]
at save at NativeMethodAccessorImpl.java:0)
> 17/04/18 01:41:14 INFO YarnScheduler: Adding task set 4.0 with 9 tasks
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 72, xxx.xxx.xx.xx.xx,
executor 12, partition 1, NODE_LOCAL, 8184 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 73, xxx.xxx.xx.xx.xx,
executor 13, partition 0, NODE_LOCAL, 7967 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 2.0 in stage 4.0 (TID 74, xxx.xxx.xx.xx.xx,
executor 14, partition 2, NODE_LOCAL, 8181 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 6.0 in stage 4.0 (TID 75, xxx.xxx.xx.xx.xx,
executor 16, partition 6, NODE_LOCAL, 8400 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 7.0 in stage 4.0 (TID 76, xxx.xxx.xx.xx.xx,
executor 10, partition 7, NODE_LOCAL, 8398 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 3.0 in stage 4.0 (TID 77, xxx.xxx.xx.xx.xx,
executor 11, partition 3, NODE_LOCAL, 8182 bytes)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:46030
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:40494
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:35861
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:34157
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:43202
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:46053
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:46030
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO ExecutorAllocationManager: Requesting 9 new executors because
tasks are backlogged (new desired total will be 9)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:34157
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:40494
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:35861
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:46053
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:43202
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:17 INFO TaskSetManager: Starting task 4.0 in stage 4.0 (TID 78, xxx.xxx.xx.xx.xx,
executor 15, partition 4, RACK_LOCAL, 8400 bytes)
> 17/04/18 01:41:17 INFO TaskSetManager: Starting task 5.0 in stage 4.0 (TID 79, xxx.xxx.xx.xx.xx,
executor 9, partition 5, RACK_LOCAL, 8400 bytes)
> 17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:34045
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:43887
(size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:34045
(size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:43887
(size: 28.0 KB, free: 4.0 GB)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message