hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Navis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
Date Mon, 19 May 2014 06:27:38 GMT

    [ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001416#comment-14001416
] 

Navis commented on HIVE-4867:
-----------------------------

I think the patch is almost ready. But the diff file cannot be attached here(bigger than 10MB).
The most part of change is from removing duplicated lineage information. So I'm thinking of
fixing that first.

> Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-4867
>                 URL: https://issues.apache.org/jira/browse/HIVE-4867
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Yin Huai
>            Assignee: Navis
>         Attachments: HIVE-4867.1.patch.txt, source_only.txt
>
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, a column
may appear in both the key list and value list, which result in unnecessary overhead for shuffling.

> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         store_sales 
>           TableScan
>             alias: store_sales
>             Select Operator
>               expressions:
>                     expr: ss_ticket_number
>                     type: int
>               outputColumnNames: _col0
>               Reduce Output Operator
>                 key expressions:
>                       expr: _col0
>                       type: int
>                 sort order: +
>                 Map-reduce partition columns:
>                       expr: _col0
>                       type: int
>                 tag: -1
>                 value expressions:
>                       expr: _col0
>                       type: int
>       Reduce Operator Tree:
>         Extract
>           File Output Operator
>             compressed: false
>             GlobalTableId: 0
>             table:
>                 input format: org.apache.hadoop.mapred.TextInputFormat
>                 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the ReduceSinkOperator.
The type of ss_ticket_number is int. For this case, BinarySortableSerDe will introduce 1 byte
more for every int in the key. LazyBinarySerDe will also introduce overhead when recording
the length of a int. For every int, 10 bytes should be a rough estimation of the size of data
emitted from the Map phase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message