hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-3733) Improve Hive's logic for conditional merge
Date Fri, 07 Dec 2012 02:39:21 GMT

     [ https://issues.apache.org/jira/browse/HIVE-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pradeep Kamath updated HIVE-3733:
---------------------------------

    Attachment: HIVE-3733.4.patch.txt

Changed the code which looks in the Operator Stack to look for ReduceSinkOperator instead
of the exact CurrWork.getReducer() instance.

union19 no longer performs a conditional merge with this change. My hypothesis for this follows:

The union19 query is:
FROM (select 'tst1' as key, cast(count(1) as string) as value from src s1
UNION ALL
select s2.key as key, s2.value as value from src s2) unionsrc
INSERT OVERWRITE TABLE DEST1 SELECT unionsrc.key, count(unionsrc.value) group by unionsrc.key
INSERT OVERWRITE TABLE DEST2 SELECT unionsrc.key, unionsrc.value, unionsrc.value;

The from subquery has an implicit group by/ReduceSink due to the count. So though the second
insert in the multi insert by itself does not have a groupby/ReduceSink, the subquery in the
from clause causes the groupby/ReduceSink to appear in the stack and hence we decide not to
do the conditional merge since the FileSink will be in the reduce.
                
> Improve Hive's logic for conditional merge
> ------------------------------------------
>
>                 Key: HIVE-3733
>                 URL: https://issues.apache.org/jira/browse/HIVE-3733
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>         Attachments: HIVE-3733.1.patch.txt, HIVE-3733.3.patch.txt, HIVE-3733.4.patch.txt
>
>
> If the config hive.merge.mapfiles is set to true and hive.merge.mapredfiles is set to
false then when hive encounters a FileSinkOperator when generating map reduce tasks, it will
look at the entire job to see if it has a reducer, if it does it will not merge. Instead it
should be check if the FileSinkOperator is a child of the reducer. This means that outputs
generated in the mapper will be merged, and outputs generated in the reducer will not be,
the intended effect of setting those configs.
> Simple repro:
> set hive.merge.mapfiles=true;
> set hive.merge.mapredfiles=false;
> EXPLAIN
> FROM <input_table>
> INSERT OVERWRITE TABLE <output_table1> SELECT key, COUNT(*) group by key
> INSERT OVERWRITE TABLE <output_table2> SELECT *;
> The output should contain a Conditional Operator, Mapred Stages, and Move tasks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message