spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bogdan Raducanu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-21228) InSet incorrect handling of structs
Date Tue, 27 Jun 2017 13:59:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bogdan Raducanu updated SPARK-21228:
------------------------------------
    Description: 
In InSet it's possible that hset contains GenericInternalRows while child returns UnsafeRows
(and vice versa). InSet uses hset.contains (both in doCodeGen and eval) which will always
be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the default is 10 which
requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in (named_struct('a', 1L, 'b',
1L),named_struct('a', 2L, 'b', 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate
here will return UnsafeRows while the list of structs that will become hset will be GenericInternalRows
+----+
|minA|
+----+
+----+
{code}

In.doCodeGen uses compareStructs and seems to work. In.eval might not work but not sure how
to reproduce.

{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it will not use
InSet
sql("select * from (select min(a) as minA from A) A where minA in (named_struct('a', 1L, 'b',
1L),named_struct('a', 2L, 'b', 2L),named_struct('a', 3L, 'b', 3L))").show

+-----+
| minA|
+-----+
|[1,1]|
+-----+
{code}

Solution could be either to do safe<->unsafe conversion in InSet or not trigger InSet
optimization at all in this case.
Need to investigate if In.eval is affected.


  was:
In InSet it's possible that hset contains GenericInternalRows while child returns UnsafeRows
(and vice versa). InSet.doCodeGen uses hset.contains which will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the default is 10 which
requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in (named_struct('a', 1L, 'b',
1L),named_struct('a', 2L, 'b', 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate
here will return UnsafeRows while the list of structs that will become hset will be GenericInternalRows
+----+
|minA|
+----+
+----+
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it will not use
InSet
sql("select * from (select min(a) as minA from A) A where minA in (named_struct('a', 1L, 'b',
1L),named_struct('a', 2L, 'b', 2L),named_struct('a', 3L, 'b', 3L))").show

+-----+
| minA|
+-----+
|[1,1]|
+-----+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or not trigger
InSet optimization at all in this case.



> InSet incorrect handling of structs
> -----------------------------------
>
>                 Key: SPARK-21228
>                 URL: https://issues.apache.org/jira/browse/SPARK-21228
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child returns UnsafeRows
(and vice versa). InSet uses hset.contains (both in doCodeGen and eval) which will always
be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the default is
10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in (named_struct('a',
1L, 'b', 1L),named_struct('a', 2L, 'b', 2L),named_struct('a', 3L, 'b', 3L))").show // the
Aggregate here will return UnsafeRows while the list of structs that will become hset will
be GenericInternalRows
> +----+
> |minA|
> +----+
> +----+
> {code}
> In.doCodeGen uses compareStructs and seems to work. In.eval might not work but not sure
how to reproduce.
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it will not
use InSet
> sql("select * from (select min(a) as minA from A) A where minA in (named_struct('a',
1L, 'b', 1L),named_struct('a', 2L, 'b', 2L),named_struct('a', 3L, 'b', 3L))").show
> +-----+
> | minA|
> +-----+
> |[1,1]|
> +-----+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet or not trigger
InSet optimization at all in this case.
> Need to investigate if In.eval is affected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message