spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Babulal (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider
Date Tue, 04 Sep 2018 16:15:00 GMT
Babulal created SPARK-25332:
-------------------------------

             Summary: Instead of broadcast hash join  ,Sort merge join has selected when restart
spark-shell/spark-JDBC for hive provider
                 Key: SPARK-25332
                 URL: https://issues.apache.org/jira/browse/SPARK-25332
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Babulal


spark.sql("create table x1(name string,age int) stored as parquet ")
 spark.sql("insert into x1 select 'a',29")
 spark.sql("create table x2 (name string,age int) stored as parquet '")
 spark.sql("insert into x2_ex select 'a',29")
 scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain

== Physical Plan ==
*{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, BuildRight
:- *(2) Project [name#101, age#102]
: +- *(2) Filter isnotnull(name#101)
: +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, Format: Parquet,
Location: InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, PartitionFilters:
[], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:string,age:int>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
 +- *(1) Project [name#103, age#104]
 +- *(1) Filter isnotnull(name#103)
 +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, Format: Parquet,
Location: InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, PartitionFilters:
[], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:string,age:int>

 

 

Now Restart Spark-Shell or do spark-submit orrestart JDBCServer  again and run same select
query again

 

scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
== Physical Plan ==
*{color:#FF0000}(5) SortMergeJoin [{color}name#43], [name#45], Inner
:- *(2) Sort [name#43 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(name#43, 200)
: +- *(1) Project [name#43, age#44]
: +- *(1) Filter isnotnull(name#43)
: +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: Parquet, Location:
InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], PartitionFilters:
[], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:string,age:int>
+- *(4) Sort [name#45 ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(name#45, 200)
 +- *(3) Project [name#45, age#46]
 +- *(3) Filter isnotnull(name#45)
 +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: Parquet, Location:
InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], PartitionFilters:
[], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:string,age:int>

 

 

scala> spark.sql("desc formatted x1").show(200,false)
+----------------------------+--------------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+--------------------------------------------------------------+-------+
|name |string |null |
|age |int |null |
| | | |
|# Detailed Table Information| | |
|Database |default | |
|Table |x1 | |
|Owner |Administrator | |
|Created Time |Sun Aug 19 12:36:58 IST 2018 | |
|Last Access |Thu Jan 01 05:30:00 IST 1970 | |
|Created By |Spark 2.3.0 | |
|Type |MANAGED | |
|Provider |hive | |
|Table Properties |[transient_lastDdlTime=1534662418] | |
|Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | |
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | |
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
|Storage Properties |[serialization.format=1] | |
|Partition Provider |Catalog | |
+----------------------------+--------------------------------------------------------------+-------+

 

With datasource table ,working fine ( create table using parquet instead of stored by )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message