spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jamie Hutton (JIRA)" <>
Subject [jira] [Created] (SPARK-15673) Indefinite hanging issue with combination of cache, sort and unionAll
Date Tue, 31 May 2016 16:58:13 GMT
Jamie Hutton created SPARK-15673:

             Summary: Indefinite hanging issue with combination of cache, sort and unionAll
                 Key: SPARK-15673
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.6.1, 1.6.0
         Environment: I am running the test code on both a hortonworks sandbox and also on
            Reporter: Jamie Hutton

I have raised a couple of bugs to do with spark hanging. One of the previous ones (
has been resolved in 1.6.1 but the following example is still an issue in 1.6.1. 

The code below is a self-contained test case which generates some data and will lead to the
hanging behaviour when run in spark-submit in 1.6.0 or 1.6.1. Strangely the code also hangs
in spark-shell in 1.6.0 but it doesnt seem to in 1.6.1 (hence providing the main method test
below). I run this using:

spark-submit --class HangingTest --master local <path-to-compiled-jar>

The hanging doesnt occur if you remove either of the first two cache steps OR the sort steps
(I have added comments to this affect below). We have hit quite a few indefinite hanging issues
with spark (another is this: There seems
to be a rather fundamental issue with chaining steps together and using the cache call. 

The bug seems to be confined to reading data out of hadoop - if we put the data onto a local
drive (using file://) then the hanging stops happening. 

This may seem rather a convoluted test case but that is mainly because I have stripped the
code back to the simplest possible code that causes the issue.


import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions.count
import org.apache.spark.sql.functions.desc

object HangingTest {
  def main(args: Array[String]): Unit = {    
    val conf =   new SparkConf()

    val sc = new SparkContext(conf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    /*Generate some data*/
    val r = scala.util.Random
    val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long]))
    val distData = sc.parallelize(list)
    import sqlContext.implicits._
    val df=distData.toDF("var1","var2")
    val df1 ="/tmp/df_hanging_test1.parquet")
    /*Removing this step stops the hanging*/
    /*Removing the sort part of this step stops the hanging*/
    val"var1").groupBy("var1").agg(count("var1") as "var1_cnt").sort(desc("var1_cnt"))
    /*Removing this step stops the hanging*/
    val df2 ="/tmp/df_hanging_test2.parquet")
    /*Removing the sort part of this step stops the hanging*/
    val"var2").groupBy("var2").agg(count("var2") as "var2_cnt").sort(desc("var2_cnt"))

    /*this cache step hangs indefinitely*/
    /*the .show never happens - it gets stuck on the .cache above*/



This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message