spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yin Huai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
Date Thu, 05 Mar 2015 06:14:38 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348225#comment-14348225
] 

Yin Huai commented on SPARK-5791:
---------------------------------

I see. In Hive's plan, all of item, warehouse, and date_dim are broadcast tables. However,
in Spark SQL's plan, the join between item and inventory was a shuffle join. Can you set the
value of spark.sql.autoBroadcastJoinThreshold larger than the size of item? Also, what is
the value of spark.serializer? Using org.apache.spark.serializer.KryoSerializer for spark.serializer
will also help the performance (we will use Kryo to serialize broadcast tables). 

> [Spark SQL] show poor performance when multiple table do join operation
> -----------------------------------------------------------------------
>
>                 Key: SPARK-5791
>                 URL: https://issues.apache.org/jira/browse/SPARK-5791
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: Yi Zhou
>         Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt
>
>
> Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message