hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yin Huai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.
Date Wed, 18 Dec 2013 14:10:07 GMT

    [ https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13851756#comment-13851756
] 

Yin Huai commented on HIVE-5945:
--------------------------------

Two minor comments in the review board.

Two additional comments.
When we find 
{code}
bigTableFileAlias != null
{\code}
can we also log sumOfOthers and the threshold of the size of small tables? So, the log entry
will show the size of the big table, the total size of other small tables, and the threshold
of the size of small tables.
Also, can you add a unit test?

Thanks :)

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which
are not used in the child of this conditional task.
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-5945
>                 URL: https://issues.apache.org/jira/browse/HIVE-5945
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
>            Reporter: Yin Huai
>            Assignee: Navis
>            Priority: Critical
>         Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, HIVE-5945.3.patch.txt
>
>
> Here is an example
> {code}
> select
>    i_item_id,
>    s_state,
>    avg(ss_quantity) agg1,
>    avg(ss_list_price) agg2,
>    avg(ss_coupon_amt) agg3,
>    avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>    cd_gender = 'F' and
>    cd_marital_status = 'U' and
>    cd_education_status = 'Primary' and
>    d_year = 2002 and
>    s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>    i_item_id,
>    s_state
> order by
>    i_item_id,
>    s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only jobs for
this query. However, I got 1 Map-only job (joining strore_sales and date_dim) and 3 MR job
(for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving item. In
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap contains all
input tables used in this query and the intermediate table generated by joining store_sales
and date_dim. So, when we sum the size of all small tables, the size of store_sales (which
is around 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message