spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lior Chaga (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-21795) Broadcast hint ignored when dataframe is cached
Date Mon, 21 Aug 2017 05:41:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-21795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lior Chaga updated SPARK-21795:
-------------------------------
    Description: 
Not sure if it's a bug or by design, but if a DF is cached, the broadcast hint is ignored,
and spark uses SortMergeJoin.

{code}
val largeDf = ...
val smalDf = ...
smallDf = smallDf.cache

largeDf.join(broadcast(smallDf))

{code}

It make sense there's no need to use cache when using broadcast join, however, I wonder if
it's the correct behavior for spark to ignore the broadcast hint just because the DF is cached.
Consider a case when a DF should be cached for several queries, and on different queries it
should be broadcasted.

If this is the correct behavior, at least it's worth documenting that cached DF cannot be
broadcasted.

  was:
Not sure if it's a bug or by design, but if a DF is cached, the broadcast hint is ignored,
and spark uses SortMergeJoin.

{{code}}
val largeDf = ...
val smalDf = ...
smallDf = smallDf.cache

largeDf.join(broadcast(smallDf))

{{code}}

It make sense there's no need to use cache when using broadcast join, however, I wonder if
it's the correct behavior for spark to ignore the broadcast hint just because the DF is cached.
Consider a case when a DF should be cached for several queries, and on different queries it
should be broadcasted.

If this is the correct behavior, at least it's worth documenting that cached DF cannot be
broadcasted.


> Broadcast hint ignored when dataframe is cached
> -----------------------------------------------
>
>                 Key: SPARK-21795
>                 URL: https://issues.apache.org/jira/browse/SPARK-21795
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Lior Chaga
>            Priority: Minor
>
> Not sure if it's a bug or by design, but if a DF is cached, the broadcast hint is ignored,
and spark uses SortMergeJoin.
> {code}
> val largeDf = ...
> val smalDf = ...
> smallDf = smallDf.cache
> largeDf.join(broadcast(smallDf))
> {code}
> It make sense there's no need to use cache when using broadcast join, however, I wonder
if it's the correct behavior for spark to ignore the broadcast hint just because the DF is
cached. Consider a case when a DF should be cached for several queries, and on different queries
it should be broadcasted.
> If this is the correct behavior, at least it's worth documenting that cached DF cannot
be broadcasted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message