hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15114) Remove extra MoveTask operators
Date Thu, 03 Nov 2016 06:13:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631757#comment-15631757
] 

Sahil Takiar commented on HIVE-15114:
-------------------------------------

One easy way to fix this would be to just walk the Task Tree after all Conditional Tasks have
been resolved and combine any sequential MoveTasks. This could have a few advantages:

* Change is less invasive of the current logic
* Could benefit other areas where sequential MoveTasks get added to the plan
* Regression proof - if we fix the duplicate MoveTasks now, its going to be hard to add unit
tests to ensure they don't get added back in - this approach avoids that problem altogether

This assumes that sequential MoveTasks can be combined 

> Remove extra MoveTask operators
> -------------------------------
>
>                 Key: HIVE-15114
>                 URL: https://issues.apache.org/jira/browse/HIVE-15114
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Hive
>    Affects Versions: 2.1.0
>            Reporter: Sahil Takiar
>
> When running simple insert queries (e.g. {{INSERT INTO TABLE ... VALUES ...}}) there
an extraneous {{MoveTask}s is created.
> This is problematic when the scratch directory is on S3 since renames require copying
the entire dataset.
> For simple queries (like the one above), there are two MoveTasks. The first one moves
the output data from one file in the scratch directory to another file in the scratch directory.
The second MoveTask moves the data from the scratch directory to its final table location.
> The first MoveTask should not be necessary. The goal of this JIRA it to remove it. This
should help improve performance when running on S3.
> It seems that the first Move might be caused by a dependency resolution problem in the
optimizer, where a dependent task doesn't get properly removed when the task it depends on
is filtered by a condition resolver.
> A dummy {{MoveTask}} is added in the {{GenMapRedUtils.createMRWorkForMergingFiles}} method.
This method creates a conditional task which launches a job to merge tasks at the end of the
file. At the end of the conditional job there is a MoveTask.
> Even though Hive decides that the conditional merge job is no needed, it seems the MoveTask
is still added to the plan.
> Seems this extra {{MoveTask}} may have been added intentionally. Not sure why yet. The
{{ConditionalResolverMergeFiles}} says that one of three tasks will be returned: move task
only, merge task only, merge task followed by a move task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message