spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guoqiang Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-1712) ParallelCollectionRDD operations hanging forever without any error messages
Date Tue, 06 May 2014 07:47:14 GMT

    [ https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990390#comment-13990390
] 

Guoqiang Li commented on SPARK-1712:
------------------------------------

Hi [~pwendell]  
What are your thoughts on this issue?

> ParallelCollectionRDD operations hanging forever without any error messages 
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-1712
>                 URL: https://issues.apache.org/jira/browse/SPARK-1712
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.9.0
>         Environment: Linux Ubuntu 14.04, a single spark node; standalone mode.
>            Reporter: Piotr Kołaczkowski
>         Attachments: executor.jstack.txt, master.jstack.txt, repl.jstack.txt, spark-hang.png,
worker.jstack.txt
>
>
> {noformat}
> scala> val collection = (1 to 1000000).map(i => ("foo" + i, i)).toVector
> collection: Vector[(String, Int)] = Vector((foo1,1), (foo2,2), (foo3,3), (foo4,4), (foo5,5),
(foo6,6), (foo7,7), (foo8,8), (foo9,9), (foo10,10), (foo11,11), (foo12,12), (foo13,13), (foo14,14),
(foo15,15), (foo16,16), (foo17,17), (foo18,18), (foo19,19), (foo20,20), (foo21,21), (foo22,22),
(foo23,23), (foo24,24), (foo25,25), (foo26,26), (foo27,27), (foo28,28), (foo29,29), (foo30,30),
(foo31,31), (foo32,32), (foo33,33), (foo34,34), (foo35,35), (foo36,36), (foo37,37), (foo38,38),
(foo39,39), (foo40,40), (foo41,41), (foo42,42), (foo43,43), (foo44,44), (foo45,45), (foo46,46),
(foo47,47), (foo48,48), (foo49,49), (foo50,50), (foo51,51), (foo52,52), (foo53,53), (foo54,54),
(foo55,55), (foo56,56), (foo57,57), (foo58,58), (foo59,59), (foo60,60), (foo61,61), (foo62,62),
(foo63,63), (foo64,64), (foo...
> scala> val rdd = sc.parallelize(collection)
> rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize
at <console>:24
> scala> rdd.first
> res4: (String, Int) = (foo1,1)
> scala> rdd.map(_._2).sum
> // nothing happens
> {noformat}
> CPU and I/O idle. 
> Memory usage reported by JVM, after manually triggered GC:
> repl: 216 MB / 2 GB
> executor: 67 MB / 2 GB
> worker: 6 MB / 128 MB
> master: 6 MB / 128 MB
> No errors found in worker's stderr/stdout. 
> It works fine with 700,000 elements and then it takes about 1 second to process the request
and calculate the sum. With 700,000 items the spark executor memory doesn't even exceed 300
MB out of 2GB available. It fails with 800,000 items.
> Multiple parralelized collections of size 700,000 items at the same time in the same
session work fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message