spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yi Zhou (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
Date Thu, 05 Mar 2015 07:41:38 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348324#comment-14348324
] 

Yi Zhou edited comment on SPARK-5791 at 3/5/15 7:40 AM:
--------------------------------------------------------

Thank you [~yhuai]. Updated SparkSQL physical plan with  below parameters with great improved
performance. But from latest test results, the query still get slow compared with Hive on
M/R (~6min vs ~2min)
spark.sql.shuffle.partitions=200;
spark.sql.autoBroadcastJoinThreshold=209715200;
spark.serializer=org.apache.spark.serializer.KryoSerializer



was (Author: jameszhouyi):
Thank you [~yhuai]. Updated SparkSQL physical plan with  below parameters with great improved
performance. But from latest test results, the query still get slow compared with Hive on
M/R (~6min vs ~2min)
spark.sql.shuffle.partitions=200;
spark.sql.autoBroadcastJoinThreshold=209715200;
spark.serializer=org.apache.spark.serializer.KryoSerializer

== Physical Plan ==
InsertIntoHiveTable (MetastoreRelation bigbenchorc, q22_spark_run_query_0_result, None), Map(),
false
 Sort [w_warehouse_name#674 ASC,i_item_id#651 ASC], false
  Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200)
   Filter (((inv_before#635L > 0) && ((CAST(inv_after#636L, DoubleType) / CAST(inv_before#635L,
DoubleType)) >= 0.6666666666666666)) && ((CAST(inv_after#636L, DoubleType) / CAST(inv_before#635L,
DoubleType)) <= 1.5))
    Aggregate false, [w_warehouse_name#674,i_item_id#651], [w_warehouse_name#674,i_item_id#651,SUM(PartialSum#716L)
AS inv_before#635L,SUM(PartialSum#717L) AS inv_after#636L]
     Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200)
      Aggregate true, [w_warehouse_name#674,i_item_id#651], [w_warehouse_name#674,i_item_id#651,SUM(CAST(CASE
WHEN (HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
< 0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS PartialSum#716L,SUM(CAST(CASE
WHEN (HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
>= 0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS PartialSum#717L]
       Project [w_warehouse_name#674,i_item_id#651,d_date#688,inv_quantity_on_hand#649]
        BroadcastHashJoin [inv_date_sk#646L], [d_date_sk#686L], BuildRight
         Project [i_item_id#651,w_warehouse_name#674,inv_date_sk#646L,inv_quantity_on_hand#649]
          BroadcastHashJoin [inv_warehouse_sk#648L], [w_warehouse_sk#672L], BuildRight
           Project [inv_warehouse_sk#648L,i_item_id#651,inv_date_sk#646L,inv_quantity_on_hand#649]
            BroadcastHashJoin [inv_item_sk#647L], [i_item_sk#650L], BuildRight
             HiveTableScan [inv_date_sk#646L,inv_item_sk#647L,inv_warehouse_sk#648L,inv_quantity_on_hand#649],
(MetastoreRelation bigbenchorc, inventory, Some(inv)), None
             Project [i_item_id#651,i_item_sk#650L]
              Filter ((i_current_price#655 > 0.98) && (i_current_price#655 <
1.5))
               HiveTableScan [i_item_id#651,i_item_sk#650L,i_current_price#655], (MetastoreRelation
bigbenchorc, item, None), None
           HiveTableScan [w_warehouse_name#674,w_warehouse_sk#672L], (MetastoreRelation bigbenchorc,
warehouse, Some(w)), None
         Filter ((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
>= -30) && (HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
<= 30))
          HiveTableScan [d_date_sk#686L,d_date#688], (MetastoreRelation bigbenchorc, date_dim,
Some(d)), None
Time taken: 2.579 seconds


> [Spark SQL] show poor performance when multiple table do join operation
> -----------------------------------------------------------------------
>
>                 Key: SPARK-5791
>                 URL: https://issues.apache.org/jira/browse/SPARK-5791
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: Yi Zhou
>         Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt
>
>
> Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message