hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <>
Subject [jira] [Updated] (HIVE-3733) Improve Hive's logic for conditional merge
Date Thu, 13 Dec 2012 18:08:12 GMT


Pradeep Kamath updated HIVE-3733:

    Attachment: HIVE-3733.5.patch.txt

I have attached HIVE-3733.5.patch.txt for review (also added it to differential at
with some changes but essentially implementing the fix for this issue at the physical optimizer
level. The code checks if a non reduce FileSinkOperator in a MapRedTask (which is not child
of a ConditionTask so we don't go after merge Tasks) can be conditionally merged and uses
the code from GenMRFileSink1 to actually introduce the conditional merge.

All tests pass besides the two below:
testCliDriver_stats19 - This succeeds on my Mac but fails on a linux machine - not quite sure
what to make of it. 
testNegativeCliDriver_stats_aggregator_error_1 produces an error during execution - I am assuming
this testcase has been known to be flaky and the error is not due to the current changes

Committers, please review carefully to make sure I haven't missed any corner cases and I have
left the tasks/plan in a valid state.

> Improve Hive's logic for conditional merge
> ------------------------------------------
>                 Key: HIVE-3733
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>         Attachments: HIVE-3733.1.patch.txt, HIVE-3733.3.patch.txt, HIVE-3733.4.patch.txt,
HIVE-3733.5.patch.txt, HIVE-3733.optimizer.patch.txt
> If the config hive.merge.mapfiles is set to true and hive.merge.mapredfiles is set to
false then when hive encounters a FileSinkOperator when generating map reduce tasks, it will
look at the entire job to see if it has a reducer, if it does it will not merge. Instead it
should be check if the FileSinkOperator is a child of the reducer. This means that outputs
generated in the mapper will be merged, and outputs generated in the reducer will not be,
the intended effect of setting those configs.
> Simple repro:
> set hive.merge.mapfiles=true;
> set hive.merge.mapredfiles=false;
> FROM <input_table>
> INSERT OVERWRITE TABLE <output_table1> SELECT key, COUNT(*) group by key
> The output should contain a Conditional Operator, Mapred Stages, and Move tasks

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message