hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyunzhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
Date Fri, 03 Nov 2017 08:31:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237271#comment-16237271
] 

liyunzhang commented on HIVE-17486:
-----------------------------------

[~lirui]:
{quote}
I also think that's possible in theory. But I guess it will require lots of work. E.g. we
may need to modify MapOperator to accommodate the new M->M->R scheme
{quote}
  now i am working on changing from {{M->R}} to {{M->M->R}} schema. Not very clear
about the modification on MapOperator. If you know, please say more detailed. I think at first
need change {{GenSparkWork}} to split the physical operator trees once en-counting one TS
has more than 1 child.  For example 
physical plan
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
        -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]

{code}
As TS\[0\] has two children(FIL\[52\], FIL\[53\]). First split at TS\[0\] and bring it to
Map1, then split following operator trees when en counting RS.  So the final operator tree
will be
{code}
Map1: TS[0]
Map2:FIL[52]-SEL[2]-GBY[3]-RS[4]
Map3:FIL[53]-SEL[9]-GBY[10]-RS[11]
Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
Reducer2:GBY[12]-RS[43]
{code}
This is very initial thinking. If have suggestion, please tell me, thanks!

> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
>                 Key: HIVE-17486
>                 URL: https://issues.apache.org/jira/browse/HIVE-17486
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang
>            Assignee: liyunzhang
>            Priority: Major
>         Attachments: scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be merged
so the data is read only once. Optimization will be carried out at the physical level.  In
Hive on Spark, it caches the result of spark work if the spark work is used by more than 1
child spark work. After sharedWorkOptimizer is enabled in physical plan in HoS, the identical
table scans are merged to 1 table scan. This result of table scan will be used by more 1 child
spark work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message