spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liang-Chi Hsieh <vii...@gmail.com>
Subject Re: [SQL] Syntax "case when" doesn't be supported in JOIN
Date Mon, 17 Jul 2017 23:57:36 GMT

Evaluation order does matter. A non-deterministic expression can change its
output due to internal state which may depend on input order.

MonotonicallyIncreasingID is an example for the stateful expression. Once
you change the row order, the evaluation results are different.



Chang Chen wrote
> I see.
> 
> Actually, it isn't about evaluation order which user can't specify. It's
> about how many times we evaluate the non-deterministic expression for the
> same row.
> 
> For example, given the SQL:
> 
> SELECT a.col1
> FROM tbl1 a
> LEFT OUTER JOIN tbl2 b
> ON
>  CASE WHEN a.col2 IS NULL TNEN cast(rand(9)*1000 - 9999999999 as string)
> ELSE a.col2 END
>         =
>  CASE WHEN b.col3 IS NULL TNEN cast(rand(9)*1000 - 9999999999 as string)
> ELSE b.col3 END;
> 
> I think if we exactly evaluate   join key one time for each row of a and b
> in the whole pipeline, even if the result isn't deterministic, but the
> computation is correct.
> 
> Thanks
> Chang
> 
> 
> On Mon, Jul 17, 2017 at 10:49 PM, Liang-Chi Hsieh &lt;

> viirya@

> &gt; wrote:
> 
>>
>> IIUC, the evaluation order of rows in Join can be different in different
>> physical operators, e.g., Sort-based and Hash-based.
>>
>> But for non-deterministic expressions, different evaluation orders change
>> results.
>>
>>
>>
>> Chang Chen wrote
>> > I see the issue. I will try https://github.com/apache/spark/pull/18652,
>> I
>> > think
>> >
>> > 1 For Join Operator, the left and right plan can't be
>> non-deterministic.
>> > 2 If  Filter can support non-deterministic, why not join condition?
>> > 3 We can't push down or project non-deterministic expression, since it
>> may
>> > change semantics.
>> >
>> > Actually, the real problem is #2. If the join condition could be
>> > non-deterministic, then we needn't insert project.
>> >
>> > Thanks
>> > Chang
>> >
>> >
>> >
>> >
>> > On Mon, Jul 17, 2017 at 3:59 PM, θ’‹ζ˜Ÿεš &lt;
>>
>> > jiangxb1987@
>>
>> > &gt; wrote:
>> >
>> >> FYI there have been a related discussion here:
>> https://github.com/apache/
>> >> spark/pull/15417#discussion_r85295977
>> >>
>> >> 2017-07-17 15:44 GMT+08:00 Chang Chen &lt;
>>
>> > baibaichen@
>>
>> > &gt;:
>> >>
>> >>> Hi All
>> >>>
>> >>> I don't understand the difference between the semantics, I found
>> Spark
>> >>> does the same thing for GroupBy non-deterministic. From Map-Reduce
>> point
>> >>> of
>> >>> view, Join is also GroupBy in essence .
>> >>>
>> >>> @Liang Chi Hsieh
>> >>> &lt;https://plus.google.com/u/0/103179362592085650735?prsrc=4&gt;
>> >>>
>> >>> in which situation,  semantics  will be changed?
>> >>>
>> >>> Thanks
>> >>> Chang
>> >>>
>> >>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh &lt;
>>
>> > viirya@
>>
>> > &gt;
>> >>> wrote:
>> >>>
>> >>>>
>> >>>> Thinking about it more, I think it changes the semantics only under
>> >>>> certain
>> >>>> scenarios.
>> >>>>
>> >>>> For the example SQL query shown in previous discussion, it looks
the
>> >>>> same
>> >>>> semantics.
>> >>>>
>> >>>>
>> >>>> Xiao Li wrote
>> >>>> > If the join condition is non-deterministic, pushing it down
to the
>> >>>> > underlying project will change the semantics. Thus, we are
unable
>> to
>> >>>> do it
>> >>>> > in PullOutNondeterministic. Users can do it manually if they
do
>> not
>> >>>> care
>> >>>> > the semantics difference.
>> >>>> >
>> >>>> > Thanks,
>> >>>> >
>> >>>> > Xiao
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > 2017-07-16 20:07 GMT-07:00 Chang Chen &lt;
>> >>>>
>> >>>> > baibaichen@
>> >>>>
>> >>>> > &gt;:
>> >>>> >
>> >>>> >> It is tedious since we have lots of Hive SQL being migrated
to
>> >>>> Spark.
>> >>>> >> And
>> >>>> >> this workaround is equivalent  to insert a Project between
Join
>> >>>> operator
>> >>>> >> and its child.
>> >>>> >>
>> >>>> >> Why not do it in PullOutNondeterministic?
>> >>>> >>
>> >>>> >> Thanks
>> >>>> >> Chang
>> >>>> >>
>> >>>> >>
>> >>>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh &lt;
>> >>>>
>> >>>> > viirya@
>> >>>>
>> >>>> > &gt; wrote:
>> >>>> >>
>> >>>> >>>
>> >>>> >>> A possible workaround is to add the rand column into
tbl1 with a
>> >>>> >>> projection
>> >>>> >>> before the join.
>> >>>> >>>
>> >>>> >>> SELECT a.col1
>> >>>> >>> FROM (
>> >>>> >>>   SELECT col1,
>> >>>> >>>     CASE
>> >>>> >>>          WHEN col2 IS NULL
>> >>>> >>>            THEN cast(rand(9)*1000 - 9999999999 as string)
>> >>>> >>>          ELSE
>> >>>> >>>            col2
>> >>>> >>>     END AS col2
>> >>>> >>>     FROM tbl1) a
>> >>>> >>> LEFT OUTER JOIN tbl2 b
>> >>>> >>> ON a.col2 = b.col3;
>> >>>> >>>
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> Chang Chen wrote
>> >>>> >>> > Hi Wenchen
>> >>>> >>> >
>> >>>> >>> > Yes. We also find this error is caused by Rand.
However, this
>> is
>> >>>> >>> classic
>> >>>> >>> > way to solve data skew in Hive.  Is there any
equivalent way
>> in
>> >>>> Spark?
>> >>>> >>> >
>> >>>> >>> > Thanks
>> >>>> >>> > Chang
>> >>>> >>> >
>> >>>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan &lt;
>> >>>> >>>
>> >>>> >>> > cloud0fan@
>> >>>> >>>
>> >>>> >>> > &gt; wrote:
>> >>>> >>> >
>> >>>> >>> >> It’s not about case when, but about rand().
Non-deterministic
>> >>>> >>> expressions
>> >>>> >>> >> are not allowed in join condition.
>> >>>> >>> >>
>> >>>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang
&lt;
>> >>>> >>>
>> >>>> >>> > cn_wss@
>> >>>> >>>
>> >>>> >>> > &gt; wrote:
>> >>>> >>> >> >
>> >>>> >>> >> > I'm trying to execute hive sql on spark
sql (Also on spark
>> >>>> >>> >> thriftserver), For
>> >>>> >>> >> > optimizing data skew, we use "case when"
to handle null.
>> >>>> >>> >> > Simple sql as following:
>> >>>> >>> >> >
>> >>>> >>> >> >
>> >>>> >>> >> > SELECT a.col1
>> >>>> >>> >> > FROM tbl1 a
>> >>>> >>> >> > LEFT OUTER JOIN tbl2 b
>> >>>> >>> >> > ON
>> >>>> >>> >> > *     CASE
>> >>>> >>> >> >               WHEN a.col2 IS NULL
>> >>>> >>> >> >                       TNEN cast(rand(9)*1000
- 9999999999
>> as
>> >>>> >>> string)
>> >>>> >>> >> >               ELSE
>> >>>> >>> >> >                       a.col2 END *
>> >>>> >>> >> >       = b.col3;
>> >>>> >>> >> >
>> >>>> >>> >> >
>> >>>> >>> >> > But I get the error:
>> >>>> >>> >> >
>> >>>> >>> >> > == Physical Plan ==
>> >>>> >>> >> > *org.apache.spark.sql.AnalysisException:
nondeterministic
>> >>>> >>> expressions
>> >>>> >>> >> are
>> >>>> >>> >> > only allowed in
>> >>>> >>> >> > Project, Filter, Aggregate or Window,
found:*
>> >>>> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN
CAST(((rand(9) *
>> >>>> CAST(1000
>> >>>> >>> AS
>> >>>> >>> >> > DOUBLE)) - CAST(9999999999L AS DOUBLE))
AS STRING) ELSE
>> >>>> >>> a.`nav_tcdt`
>> >>>> >>> >> END
>> >>>> >>> >> =
>> >>>> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd`
AS INT) = 9)) AND
>> >>>> >>> >> (c.`cur_flag`
>> >>>> >>> >> =
>> >>>> >>> >> > 1))
>> >>>> >>> >> > in operator Join LeftOuter, (((CASE WHEN
>> isnull(nav_tcdt#25)
>> >>>> THEN
>> >>>> >>> >> > cast(((rand(9) * cast(1000 as double))
- cast(9999999999 as
>> >>>> >>> double))
>> >>>> >>> as
>> >>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80)
&&
>> >>>> >>> (cast(nav_tcd#26
>> >>>> >>> as
>> >>>> >>> >> int)
>> >>>> >>> >> > = 9)) && (cur_flag#77 = 1))
>> >>>> >>> >> >               ;;
>> >>>> >>> >> > GlobalLimit 10
>> >>>> >>> >> > +- LocalLimit 10
>> >>>> >>> >> >   +- Aggregate [date_id#7, CASE WHEN
(cast(city_id#10 as
>> >>>> string) IN
>> >>>> >>> >> > (cast(19596 as string),cast(20134 as
string),cast(10997 as
>> >>>> string))
>> >>>> >>> &&
>> >>>> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10
ELSE
>> nav_tpa_id#21
>> >>>> >>> END],
>> >>>> >>> >> > [date_id#7]
>> >>>> >>> >> >      +- Filter (date_id#7 = 2017-07-12)
>> >>>> >>> >> >         +- Join LeftOuter, (((CASE WHEN
isnull(nav_tcdt#25)
>> >>>> THEN
>> >>>> >>> >> > cast(((rand(9) * cast(1000 as double))
- cast(9999999999 as
>> >>>> >>> double))
>> >>>> >>> as
>> >>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80)
&&
>> >>>> >>> (cast(nav_tcd#26
>> >>>> >>> as
>> >>>> >>> >> int)
>> >>>> >>> >> > = 9)) && (cur_flag#77 = 1))
>> >>>> >>> >> >            :- SubqueryAlias a
>> >>>> >>> >> >            :  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
>> >>>> >>> >> >            :     +- CatalogRelation
>> >>>> >>> `tmp`.`tmp_lifan_trfc_tpa_hive`,
>> >>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> >>>> [date_id#7,
>> >>>> >>> >> chanl_id#8L,
>> >>>> >>> >> > pltfm_id#9, city_id#10, sessn_id#11,
gu_id#12,
>> >>>> >>> >> nav_refer_page_type_id#13,
>> >>>> >>> >> > nav_refer_page_value#14, nav_refer_tpa#15,
>> >>>> nav_refer_tpa_id#16,
>> >>>> >>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
>> >>>> >>> >> nav_page_value#20,
>> >>>> >>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23,
nav_tpi#24,
>> >>>> nav_tcdt#25,
>> >>>> >>> >> nav_tcd#26,
>> >>>> >>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
>> >>>> >>> >> > detl_refer_page_value#30, ... 33 more
fields]
>> >>>> >>> >> >            +- SubqueryAlias c
>> >>>> >>> >> >               +- SubqueryAlias dim_site_categ_ext
>> >>>> >>> >> >                  +- CatalogRelation
>> `dw`.`dim_site_categ_ext`,
>> >>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> >>>> >>> >> [site_categ_skid#64L,
>> >>>> >>> >> > site_categ_type#65, site_categ_code#66,
site_categ_name#67,
>> >>>> >>> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69,
>> leaf_flg#70L,
>> >>>> >>> >> sort_seq#71L,
>> >>>> >>> >> > site_categ_srch_name#72, vsbl_flg#73,
delet_flag#74,
>> >>>> >>> etl_batch_id#75L,
>> >>>> >>> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L,
>> >>>> >>> bkgrnd_categ_id#79L,
>> >>>> >>> >> > site_categ_id#80, site_categ_parnt_id#81]
>> >>>> >>> >> >
>> >>>> >>> >> > Does spark sql not support syntax "case
when" in JOIN?
>> >>>> Additional,
>> >>>> >>> my
>> >>>> >>> >> spark
>> >>>> >>> >> > version is 2.2.0.
>> >>>> >>> >> > Any help would be greatly appreciated.
>> >>>> >>> >> >
>> >>>> >>> >> >
>> >>>> >>> >> >
>> >>>> >>> >> >
>> >>>> >>> >> > --
>> >>>> >>> >> > View this message in context:
>> http://apache-spark-developers
>> >>>> >>> >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-
>> >>>> >>> >> be-supported-in-JOIN-tp21953.html
>> >>>> >>> >> > Sent from the Apache Spark Developers
List mailing list
>> >>>> archive
>> >>>> at
>> >>>> >>> >> Nabble.com.
>> >>>> >>> >> >
>> >>>> >>> >> >
>> ------------------------------------------------------------
>> >>>> >>> ---------
>> >>>> >>> >> > To unsubscribe e-mail:
>> >>>> >>>
>> >>>> >>> > dev-unsubscribe@.apache
>> >>>> >>>
>> >>>> >>> >> >
>> >>>> >>> >>
>> >>>> >>> >>
>> >>>> >>> >> ------------------------------------------------------------
>> >>>> ---------
>> >>>> >>> >> To unsubscribe e-mail:
>> >>>> >>>
>> >>>> >>> > dev-unsubscribe@.apache
>> >>>> >>>
>> >>>> >>> >>
>> >>>> >>> >>
>> >>>> >>>
>> >>>> >>>
>> >>>> >>>
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> -----
>> >>>> >>> Liang-Chi Hsieh | @viirya
>> >>>> >>> Spark Technology Center
>> >>>> >>> http://www.spark.tc/
>> >>>> >>> --
>> >>>> >>> View this message in context: http://apache-spark-developers
>> >>>> >>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-
>> >>>> >>> supported-in-JOIN-tp21953p21961.html
>> >>>> >>> Sent from the Apache Spark Developers List mailing
list archive
>> at
>> >>>> >>> Nabble.com.
>> >>>> >>>
>> >>>> >>> ------------------------------------------------------------
>> >>>> ---------
>> >>>> >>> To unsubscribe e-mail:
>> >>>>
>> >>>> > dev-unsubscribe@.apache
>> >>>>
>> >>>> >>>
>> >>>> >>>
>> >>>> >>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> -----
>> >>>> Liang-Chi Hsieh | @viirya
>> >>>> Spark Technology Center
>> >>>> http://www.spark.tc/
>> >>>> --
>> >>>> View this message in context: http://apache-spark-developers
>> >>>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-
>> >>>> supported-in-JOIN-tp21953p21973.html
>> >>>> Sent from the Apache Spark Developers List mailing list archive
at
>> >>>> Nabble.com.
>> >>>>
>> >>>>
>> ---------------------------------------------------------------------
>> >>>> To unsubscribe e-mail:
>>
>> > dev-unsubscribe@.apache
>>
>> >>>>
>> >>>>
>> >>>
>> >>
>>
>>
>>
>>
>>
>> -----
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/SQL-Syntax-case-
>> when-doesn-t-be-supported-in-JOIN-tp21953p21982.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-supported-in-JOIN-tp21953p21988.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message