spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takeshi Yamamuro (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-21579) dropTempView has a critical BUG
Date Mon, 31 Jul 2017 13:17:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107296#comment-16107296
] 

Takeshi Yamamuro edited comment on SPARK-21579 at 7/31/17 1:16 PM:
-------------------------------------------------------------------

I think this is an expected behaviour in Spark;  `cache1` is a sub-tree of `cache2`, so if
you uncache `cache1`, spark also uncaches `cache2`.
{code}
scala> Seq(("name1", 28)).toDF("name", "age").createOrReplaceTempView("base")
scala> sql("cache table cache1 as select * from base where age >= 25")
scala> sql("cache table cache2 as select * from cache1 where name = 'p1'")
scala> spark.table("cache1").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 10000, StorageLevel(disk, memory, deserialized,
1 replicas), `cache1`
         +- *Project [_1#2 AS name#5, _2#3 AS age#6]
            +- *Filter (_2#3 >= 25)
               +- LocalTableScan [_1#2, _2#3]

scala> spark.table("cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 10000, StorageLevel(disk, memory, deserialized,
1 replicas), `cache2`
         +- *Filter (isnotnull(name#5) && (name#5 = p1))
            +- InMemoryTableScan [name#5, age#6], [isnotnull(name#5), (name#5 = p1)]
                  +- InMemoryRelation [name#5, age#6], true, 10000, StorageLevel(disk, memory,
deserialized, 1 replicas), `cache1`
                        +- *Project [_1#2 AS name#5, _2#3 AS age#6]
                           +- *Filter (_2#3 >= 25)
                              +- LocalTableScan [_1#2, _2#3]
{code}

Probably, you better do this;
{code}
scala> sql("create temporary view tmp1 as select * from base where age >= 25")
scala> sql("cache table cache1 as select * from tmp1")
scala> sql("cache table cache2 as select * from tmp1 where name = 'p1'")
scala> catalog.dropTempView("cache1")
// scala> sql("select * from cache1").explain --> Throws AnalysisException `Table or
view not found`
scala> sql("select * from cache2").explain
scala> sql("select * from cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 10000, StorageLevel(disk, memory, deserialized,
1 replicas), `cache2`
         +- *Project [_1#2 AS name#5, _2#3 AS age#6]
            +- *Filter (((_2#3 >= 25) && isnotnull(_1#2)) && (_1#2 = p1))
               +- LocalTableScan [_1#2, _2#3]
{code}


was (Author: maropu):
I think this is an expected behaviour;  `cache1` is a sub-tree of `cache2`, so if you uncache
`cache1`, spark also uncaches `cache2`.
{code}
scala> Seq(("name1", 28)).toDF("name", "age").createOrReplaceTempView("base")
scala> sql("cache table cache1 as select * from base where age >= 25")
scala> sql("cache table cache2 as select * from cache1 where name = 'p1'")
scala> spark.table("cache1").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 10000, StorageLevel(disk, memory, deserialized,
1 replicas), `cache1`
         +- *Project [_1#2 AS name#5, _2#3 AS age#6]
            +- *Filter (_2#3 >= 25)
               +- LocalTableScan [_1#2, _2#3]

scala> spark.table("cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 10000, StorageLevel(disk, memory, deserialized,
1 replicas), `cache2`
         +- *Filter (isnotnull(name#5) && (name#5 = p1))
            +- InMemoryTableScan [name#5, age#6], [isnotnull(name#5), (name#5 = p1)]
                  +- InMemoryRelation [name#5, age#6], true, 10000, StorageLevel(disk, memory,
deserialized, 1 replicas), `cache1`
                        +- *Project [_1#2 AS name#5, _2#3 AS age#6]
                           +- *Filter (_2#3 >= 25)
                              +- LocalTableScan [_1#2, _2#3]
{code}

Probably, you better do this;
{code}
scala> sql("create temporary view tmp1 as select * from base where age >= 25")
scala> sql("cache table cache1 as select * from tmp1")
scala> sql("cache table cache2 as select * from tmp1 where name = 'p1'")
scala> catalog.dropTempView("cache1")
// scala> sql("select * from cache1").explain --> Throws AnalysisException `Table or
view not found`
scala> sql("select * from cache2").explain
scala> sql("select * from cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 10000, StorageLevel(disk, memory, deserialized,
1 replicas), `cache2`
         +- *Project [_1#2 AS name#5, _2#3 AS age#6]
            +- *Filter (((_2#3 >= 25) && isnotnull(_1#2)) && (_1#2 = p1))
               +- LocalTableScan [_1#2, _2#3]
{code}

> dropTempView has a critical BUG
> -------------------------------
>
>                 Key: SPARK-21579
>                 URL: https://issues.apache.org/jira/browse/SPARK-21579
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.1, 2.2.0
>            Reporter: ant_nebula
>            Priority: Critical
>
> when I dropTempView dwd_table1 only, sub table dwd_table2 also disappear from http://127.0.0.1:4040/storage/.

> It affect version 2.1.1 and 2.2.0, 2.1.0 is ok for this problem.
> {code:java}
> val spark = SparkSession.builder.master("local").appName("sparkTest").getOrCreate()
> val rows = Seq(Row("p1", 30), Row("p2", 20), Row("p3", 25), Row("p4", 10), Row("p5",
40), Row("p6", 15))
> val schema = new StructType().add(StructField("name", StringType)).add(StructField("age",
IntegerType))
> val rowRDD = spark.sparkContext.parallelize(rows, 3)
> val df = spark.createDataFrame(rowRDD, schema)
> df.createOrReplaceTempView("ods_table")
> spark.sql("cache table ods_table")
> spark.sql("cache table dwd_table1 as select * from ods_table where age>=25")
> spark.sql("cache table dwd_table2 as select * from dwd_table1 where name='p1'")
> spark.catalog.dropTempView("dwd_table1")
> //spark.catalog.dropTempView("ods_table")
> spark.sql("select * from dwd_table2").show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message