spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liang-Chi Hsieh <vii...@gmail.com>
Subject Re: [SQL] Syntax "case when" doesn't be supported in JOIN
Date Mon, 17 Jul 2017 14:49:35 GMT

IIUC, the evaluation order of rows in Join can be different in different
physical operators, e.g., Sort-based and Hash-based.

But for non-deterministic expressions, different evaluation orders change
results.



Chang Chen wrote
> I see the issue. I will try https://github.com/apache/spark/pull/18652, I
> think
> 
> 1 For Join Operator, the left and right plan can't be non-deterministic.
> 2 If  Filter can support non-deterministic, why not join condition?
> 3 We can't push down or project non-deterministic expression, since it may
> change semantics.
> 
> Actually, the real problem is #2. If the join condition could be
> non-deterministic, then we needn't insert project.
> 
> Thanks
> Chang
> 
> 
> 
> 
> On Mon, Jul 17, 2017 at 3:59 PM, θ’‹ζ˜Ÿεš &lt;

> jiangxb1987@

> &gt; wrote:
> 
>> FYI there have been a related discussion here: https://github.com/apache/
>> spark/pull/15417#discussion_r85295977
>>
>> 2017-07-17 15:44 GMT+08:00 Chang Chen &lt;

> baibaichen@

> &gt;:
>>
>>> Hi All
>>>
>>> I don't understand the difference between the semantics, I found Spark
>>> does the same thing for GroupBy non-deterministic. From Map-Reduce point
>>> of
>>> view, Join is also GroupBy in essence .
>>>
>>> @Liang Chi Hsieh
>>> &lt;https://plus.google.com/u/0/103179362592085650735?prsrc=4&gt;
>>>
>>> in which situation,  semantics  will be changed?
>>>
>>> Thanks
>>> Chang
>>>
>>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh &lt;

> viirya@

> &gt;
>>> wrote:
>>>
>>>>
>>>> Thinking about it more, I think it changes the semantics only under
>>>> certain
>>>> scenarios.
>>>>
>>>> For the example SQL query shown in previous discussion, it looks the
>>>> same
>>>> semantics.
>>>>
>>>>
>>>> Xiao Li wrote
>>>> > If the join condition is non-deterministic, pushing it down to the
>>>> > underlying project will change the semantics. Thus, we are unable to
>>>> do it
>>>> > in PullOutNondeterministic. Users can do it manually if they do not
>>>> care
>>>> > the semantics difference.
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Xiao
>>>> >
>>>> >
>>>> >
>>>> > 2017-07-16 20:07 GMT-07:00 Chang Chen &lt;
>>>>
>>>> > baibaichen@
>>>>
>>>> > &gt;:
>>>> >
>>>> >> It is tedious since we have lots of Hive SQL being migrated to
>>>> Spark.
>>>> >> And
>>>> >> this workaround is equivalent  to insert a Project between Join
>>>> operator
>>>> >> and its child.
>>>> >>
>>>> >> Why not do it in PullOutNondeterministic?
>>>> >>
>>>> >> Thanks
>>>> >> Chang
>>>> >>
>>>> >>
>>>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh &lt;
>>>>
>>>> > viirya@
>>>>
>>>> > &gt; wrote:
>>>> >>
>>>> >>>
>>>> >>> A possible workaround is to add the rand column into tbl1 with
a
>>>> >>> projection
>>>> >>> before the join.
>>>> >>>
>>>> >>> SELECT a.col1
>>>> >>> FROM (
>>>> >>>   SELECT col1,
>>>> >>>     CASE
>>>> >>>          WHEN col2 IS NULL
>>>> >>>            THEN cast(rand(9)*1000 - 9999999999 as string)
>>>> >>>          ELSE
>>>> >>>            col2
>>>> >>>     END AS col2
>>>> >>>     FROM tbl1) a
>>>> >>> LEFT OUTER JOIN tbl2 b
>>>> >>> ON a.col2 = b.col3;
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> Chang Chen wrote
>>>> >>> > Hi Wenchen
>>>> >>> >
>>>> >>> > Yes. We also find this error is caused by Rand. However,
this is
>>>> >>> classic
>>>> >>> > way to solve data skew in Hive.  Is there any equivalent
way in
>>>> Spark?
>>>> >>> >
>>>> >>> > Thanks
>>>> >>> > Chang
>>>> >>> >
>>>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan &lt;
>>>> >>>
>>>> >>> > cloud0fan@
>>>> >>>
>>>> >>> > &gt; wrote:
>>>> >>> >
>>>> >>> >> It’s not about case when, but about rand(). Non-deterministic
>>>> >>> expressions
>>>> >>> >> are not allowed in join condition.
>>>> >>> >>
>>>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang &lt;
>>>> >>>
>>>> >>> > cn_wss@
>>>> >>>
>>>> >>> > &gt; wrote:
>>>> >>> >> >
>>>> >>> >> > I'm trying to execute hive sql on spark sql (Also
on spark
>>>> >>> >> thriftserver), For
>>>> >>> >> > optimizing data skew, we use "case when" to handle
null.
>>>> >>> >> > Simple sql as following:
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> > SELECT a.col1
>>>> >>> >> > FROM tbl1 a
>>>> >>> >> > LEFT OUTER JOIN tbl2 b
>>>> >>> >> > ON
>>>> >>> >> > *     CASE
>>>> >>> >> >               WHEN a.col2 IS NULL
>>>> >>> >> >                       TNEN cast(rand(9)*1000 -
9999999999 as
>>>> >>> string)
>>>> >>> >> >               ELSE
>>>> >>> >> >                       a.col2 END *
>>>> >>> >> >       = b.col3;
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> > But I get the error:
>>>> >>> >> >
>>>> >>> >> > == Physical Plan ==
>>>> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
>>>> >>> expressions
>>>> >>> >> are
>>>> >>> >> > only allowed in
>>>> >>> >> > Project, Filter, Aggregate or Window, found:*
>>>> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9)
*
>>>> CAST(1000
>>>> >>> AS
>>>> >>> >> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING)
ELSE
>>>> >>> a.`nav_tcdt`
>>>> >>> >> END
>>>> >>> >> =
>>>> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT)
= 9)) AND
>>>> >>> >> (c.`cur_flag`
>>>> >>> >> =
>>>> >>> >> > 1))
>>>> >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25)
>>>> THEN
>>>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999
as
>>>> >>> double))
>>>> >>> as
>>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80)
&&
>>>> >>> (cast(nav_tcd#26
>>>> >>> as
>>>> >>> >> int)
>>>> >>> >> > = 9)) && (cur_flag#77 = 1))
>>>> >>> >> >               ;;
>>>> >>> >> > GlobalLimit 10
>>>> >>> >> > +- LocalLimit 10
>>>> >>> >> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10
as
>>>> string) IN
>>>> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997
as
>>>> string))
>>>> >>> &&
>>>> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE
nav_tpa_id#21
>>>> >>> END],
>>>> >>> >> > [date_id#7]
>>>> >>> >> >      +- Filter (date_id#7 = 2017-07-12)
>>>> >>> >> >         +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25)
>>>> THEN
>>>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999
as
>>>> >>> double))
>>>> >>> as
>>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80)
&&
>>>> >>> (cast(nav_tcd#26
>>>> >>> as
>>>> >>> >> int)
>>>> >>> >> > = 9)) && (cur_flag#77 = 1))
>>>> >>> >> >            :- SubqueryAlias a
>>>> >>> >> >            :  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
>>>> >>> >> >            :     +- CatalogRelation
>>>> >>> `tmp`.`tmp_lifan_trfc_tpa_hive`,
>>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>>>> [date_id#7,
>>>> >>> >> chanl_id#8L,
>>>> >>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
>>>> >>> >> nav_refer_page_type_id#13,
>>>> >>> >> > nav_refer_page_value#14, nav_refer_tpa#15,
>>>> nav_refer_tpa_id#16,
>>>> >>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
>>>> >>> >> nav_page_value#20,
>>>> >>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24,
>>>> nav_tcdt#25,
>>>> >>> >> nav_tcd#26,
>>>> >>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
>>>> >>> >> > detl_refer_page_value#30, ... 33 more fields]
>>>> >>> >> >            +- SubqueryAlias c
>>>> >>> >> >               +- SubqueryAlias dim_site_categ_ext
>>>> >>> >> >                  +- CatalogRelation `dw`.`dim_site_categ_ext`,
>>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>>>> >>> >> [site_categ_skid#64L,
>>>> >>> >> > site_categ_type#65, site_categ_code#66, site_categ_name#67,
>>>> >>> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69,
leaf_flg#70L,
>>>> >>> >> sort_seq#71L,
>>>> >>> >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74,
>>>> >>> etl_batch_id#75L,
>>>> >>> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L,
>>>> >>> bkgrnd_categ_id#79L,
>>>> >>> >> > site_categ_id#80, site_categ_parnt_id#81]
>>>> >>> >> >
>>>> >>> >> > Does spark sql not support syntax "case when"
in JOIN?
>>>> Additional,
>>>> >>> my
>>>> >>> >> spark
>>>> >>> >> > version is 2.2.0.
>>>> >>> >> > Any help would be greatly appreciated.
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> > --
>>>> >>> >> > View this message in context: http://apache-spark-developers
>>>> >>> >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-
>>>> >>> >> be-supported-in-JOIN-tp21953.html
>>>> >>> >> > Sent from the Apache Spark Developers List mailing
list
>>>> archive
>>>> at
>>>> >>> >> Nabble.com.
>>>> >>> >> >
>>>> >>> >> > ------------------------------------------------------------
>>>> >>> ---------
>>>> >>> >> > To unsubscribe e-mail:
>>>> >>>
>>>> >>> > dev-unsubscribe@.apache
>>>> >>>
>>>> >>> >> >
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> ------------------------------------------------------------
>>>> ---------
>>>> >>> >> To unsubscribe e-mail:
>>>> >>>
>>>> >>> > dev-unsubscribe@.apache
>>>> >>>
>>>> >>> >>
>>>> >>> >>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> -----
>>>> >>> Liang-Chi Hsieh | @viirya
>>>> >>> Spark Technology Center
>>>> >>> http://www.spark.tc/
>>>> >>> --
>>>> >>> View this message in context: http://apache-spark-developers
>>>> >>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-
>>>> >>> supported-in-JOIN-tp21953p21961.html
>>>> >>> Sent from the Apache Spark Developers List mailing list archive
at
>>>> >>> Nabble.com.
>>>> >>>
>>>> >>> ------------------------------------------------------------
>>>> ---------
>>>> >>> To unsubscribe e-mail:
>>>>
>>>> > dev-unsubscribe@.apache
>>>>
>>>> >>>
>>>> >>>
>>>> >>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----
>>>> Liang-Chi Hsieh | @viirya
>>>> Spark Technology Center
>>>> http://www.spark.tc/
>>>> --
>>>> View this message in context: http://apache-spark-developers
>>>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-
>>>> supported-in-JOIN-tp21953p21973.html
>>>> Sent from the Apache Spark Developers List mailing list archive at
>>>> Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>>>
>>>>
>>>
>>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-supported-in-JOIN-tp21953p21982.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message