spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "caoxuewen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-21520) Hivetable scan for all the columns the SQL statement contains the 'rand'
Date Mon, 24 Jul 2017 10:35:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

caoxuewen updated SPARK-21520:
------------------------------
    Description: 
Currently, when the rand function is present in the SQL statement, hivetable searches all
columns in the table.
e.g:
select k,k,sum(id) from (select d004 as id, floor(rand() * 10000) as k, ceil(c010) as cceila
from XXX_table) a
group by k,k;

generate WholeStageCodegen subtrees:
== Subtree 1 / 2 ==
*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L,
sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 10000.0)) AS k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612,
c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621,
c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more
fields], MetastoreRelation XXX_database, XXX_table
== Subtree 2 / 2 ==
*HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], output=[k#403L, k#403L,
sum(id)#797L])
+- Exchange hashpartitioning(k#403L, 200)
   +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L,
sum#800L])
      +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 10000.0)) AS k#403L]
         +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612,
c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621,
c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more
fields], MetastoreRelation XXX_database, XXX_table
		 
All columns will be searched in HiveTableScans , Consequently, All column data is read to
a ORC table.
e.g:
INFO ReaderImpl: Reading ORC rows from hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-00009
with {include: [true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true], offset: 0, length: 9223372036854775807}

so, The execution of the SQL statement will become very slow.


  was:
Currently, when the rand function is present in the SQL statement, hivetable searches all
columns in the table.
e.g:
select k,k,sum(id) from (select d004 as id, floor(rand() * 10000) as k, ceil(c010) as cceila
from XXX_table) a
group by k,k;

generate WholeStageCodegen subtrees:
== Subtree 1 / 2 ==
*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L,
sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 10000.0)) AS k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612,
c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621,
c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more
fields], MetastoreRelation XXX_database, XXX_table
== Subtree 2 / 2 ==
*HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], output=[k#403L, k#403L,
sum(id)#797L])
+- Exchange hashpartitioning(k#403L, 200)
   +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L,
sum#800L])
      +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 10000.0)) AS k#403L]
         +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612,
c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621,
c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more
fields], MetastoreRelation XXX_database, XXX_table
		 
All columns will be searched in HiveTableScans , Consequently, All column data is read to
a ORC table.
e.g:
INFO ReaderImpl: Reading ORC rows from hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-00009
with {include: [true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true], offset: 0, length: 9223372036854775807}

so, The execution of the SQL statement will become very slow.

solution:
Set the property of the rand expression, deterministic = true


> Hivetable scan for all the columns the SQL statement contains the 'rand'
> ------------------------------------------------------------------------
>
>                 Key: SPARK-21520
>                 URL: https://issues.apache.org/jira/browse/SPARK-21520
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: caoxuewen
>
> Currently, when the rand function is present in the SQL statement, hivetable searches
all columns in the table.
> e.g:
> select k,k,sum(id) from (select d004 as id, floor(rand() * 10000) as k, ceil(c010) as
cceila from XXX_table) a
> group by k,k;
> generate WholeStageCodegen subtrees:
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L,
sum#800L])
> +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 10000.0)) AS k#403L]
>    +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612,
c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621,
c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more
fields], MetastoreRelation XXX_database, XXX_table
> == Subtree 2 / 2 ==
> *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], output=[k#403L,
k#403L, sum(id)#797L])
> +- Exchange hashpartitioning(k#403L, 200)
>    +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))],
output=[k#403L, sum#800L])
>       +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 10000.0)) AS
k#403L]
>          +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611,
d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620,
c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629,
... 169 more fields], MetastoreRelation XXX_database, XXX_table
> 		 
> All columns will be searched in HiveTableScans , Consequently, All column data is read
to a ORC table.
> e.g:
> INFO ReaderImpl: Reading ORC rows from hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-00009
with {include: [true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true], offset: 0, length: 9223372036854775807}
> so, The execution of the SQL statement will become very slow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message