hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Navis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
Date Fri, 30 May 2014 00:41:01 GMT

    [ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013156#comment-14013156
] 

Navis commented on HIVE-4867:
-----------------------------

Yes, there is a problem in mapjoin on tez. MR compiler replaces RS with HashSink made from
value exprs of Join but Tez compiler uses RS as is state assuming it has same columns with
value exprs of Join, which is not true with this patch. Need some more time to fix it.

> Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-4867
>                 URL: https://issues.apache.org/jira/browse/HIVE-4867
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Yin Huai
>            Assignee: Navis
>         Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, source_only.txt
>
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, a column
may appear in both the key list and value list, which result in unnecessary overhead for shuffling.

> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         store_sales 
>           TableScan
>             alias: store_sales
>             Select Operator
>               expressions:
>                     expr: ss_ticket_number
>                     type: int
>               outputColumnNames: _col0
>               Reduce Output Operator
>                 key expressions:
>                       expr: _col0
>                       type: int
>                 sort order: +
>                 Map-reduce partition columns:
>                       expr: _col0
>                       type: int
>                 tag: -1
>                 value expressions:
>                       expr: _col0
>                       type: int
>       Reduce Operator Tree:
>         Extract
>           File Output Operator
>             compressed: false
>             GlobalTableId: 0
>             table:
>                 input format: org.apache.hadoop.mapred.TextInputFormat
>                 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the ReduceSinkOperator.
The type of ss_ticket_number is int. For this case, BinarySortableSerDe will introduce 1 byte
more for every int in the key. LazyBinarySerDe will also introduce overhead when recording
the length of a int. For every int, 10 bytes should be a rough estimation of the size of data
emitted from the Map phase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message