spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From heary-cao <...@git.apache.org>
Subject [GitHub] spark pull request #18725: Hivetable scan for all the columns the SQL statem...
Date Mon, 24 Jul 2017 10:29:28 GMT
GitHub user heary-cao opened a pull request:

    https://github.com/apache/spark/pull/18725

    Hivetable scan for all the columns the SQL statement contains the 'rand'

    ## What changes were proposed in this pull request?
    Currently, when the rand function is present in the SQL statement, hivetable searches
all columns in the table.
    e.g: 
    ```
    select k,k,sum(id) from (select d004 as id, floor(rand() * 10000) as k, ceil(c010) as
cceila from XXX_table) a
    group by k,k;
    ```
    
    generate WholeStageCodegen subtrees:
    ```
    == Subtree 1 / 2 ==
    *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L,
sum#800L])
    +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 10000.0)) AS k#403L]
       +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612,
c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621,
c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more
fields], MetastoreRelation XXX_database, XXX_table
    == Subtree 2 / 2 ==
    *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], output=[k#403L,
k#403L, sum(id)#797L])
    +- Exchange hashpartitioning(k#403L, 200)
       +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L,
sum#800L])
          +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 10000.0)) AS
k#403L]
             +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611,
d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620,
c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629,
... 169 more fields], MetastoreRelation XXX_database, XXX_table
    ```
    
    All columns will be searched in HiveTableScans , Consequently, All column data is read
to a ORC table.
    e.g:
    `INFO ReaderImpl: Reading ORC rows from hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-00009
with {include: [true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true], offset: 0, length: 9223372036854775807}`
    
    so, The execution of the SQL statement will become very slow.
    
    solution:
    Set the property of the rand expression, deterministic = true
    
    
    ## How was this patch tested?
    
    The unit test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/heary-cao/spark rand_deterministic

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18725.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18725
    
----
commit 07ce7d58d3196a593a0329a5e74e5126b1ca9832
Author: caoxuewen <cao.xuewen@zte.com.cn>
Date:   2017-07-24T10:18:19Z

    Hivetable scan for all the columns the SQL statement contains the 'rand'

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message