hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergio Peña (JIRA) <j...@apache.org>
Subject [jira] [Commented] (HIVE-15114) Remove extra MoveTask operators
Date Fri, 04 Nov 2016 19:09:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637367#comment-15637367
] 

Sergio Peña commented on HIVE-15114:
------------------------------------

[~stakiar] the dummy MoveTask is a little tricky to remove. 

All MoveTasks are configured with the source and destination directories previous to running
the job. Before configuring the ConditionalTask, the FileSinkOperator will write to {{-mr-10000}},
but during the ConditionalTask configuration, the FileSinkOperator is modified to write to
{{-mr-10002}} instead, and then the ConditionalTask will move the files to {{-mr-10000}} so
that the next MoveTask can continue using the same source configured in the original FileSinkOperator.

MERGE DISABLED
- FileSinkOperator writes final data to {{-mr-10000}}
- MoveTask moves data from {{-mr-10000}} to Table location {{/user/hive/warehouse/table}}

MERGE ENABLED
- FileSinkOperator writes final data to {{-mr-10002}}   << ConditionalTask modified
the destination directory
- ConditionalTask moves data from {{-mr-10002}} to {{-mr-10000}}  << ConditionalTask
will move merged or non-merged files to the original destination directory
- MoveTask moves data from {{-mr-10000}} to Table location {{/user/hive/warehouse/table}}

One way I was thinking is to include the last MoveTask into the ConditionalTask, and instead
of a dummy MoveTask (-mr-10002 -> -mr-10000), then we can change the source (-mr-10002
-> Table location). And with the merge task, we can leave it as it is, but then call the
MoveTask to copy the data to the Table Location (-mr-10000 -> Table Location).

NEW MERGE
- FileSinkOperator writes final data to {{-mr-10002}}   << ConditionalTask modified
the destination directory
- ConditionalTask does:
  a) Move Only: moves data from {{-mr-10002}} to Table location
  b) Merge Only: merges data from {{-mr-10002}} to {{-mr-10000}}, then moves data from {{-mr-10000}}
to Table location

I don't know if that approach seems reasonable, and if it is worth the effort. Also, we need
to be careful with the Tasks executed after the ConditionalTask. 
What if they are not MoveTask? 
What if there is a task that need to validate something in the {{-mr-10000}} previous to copy
it to the Table location, and that validation may decide whether to continue with the copy
to the Table or delete {{-mr-10000}}?
 
Should we try to run more performance tests with the merge disabled to see if we save a lot
of time on that MoveTask?

> Remove extra MoveTask operators
> -------------------------------
>
>                 Key: HIVE-15114
>                 URL: https://issues.apache.org/jira/browse/HIVE-15114
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Hive
>    Affects Versions: 2.1.0
>            Reporter: Sahil Takiar
>            Assignee: Sergio Peña
>
> When running simple insert queries (e.g. {{INSERT INTO TABLE ... VALUES ...}}) there
an extraneous {{MoveTask}s is created.
> This is problematic when the scratch directory is on S3 since renames require copying
the entire dataset.
> For simple queries (like the one above), there are two MoveTasks. The first one moves
the output data from one file in the scratch directory to another file in the scratch directory.
The second MoveTask moves the data from the scratch directory to its final table location.
> The first MoveTask should not be necessary. The goal of this JIRA it to remove it. This
should help improve performance when running on S3.
> It seems that the first Move might be caused by a dependency resolution problem in the
optimizer, where a dependent task doesn't get properly removed when the task it depends on
is filtered by a condition resolver.
> A dummy {{MoveTask}} is added in the {{GenMapRedUtils.createMRWorkForMergingFiles}} method.
This method creates a conditional task which launches a job to merge tasks at the end of the
file. At the end of the conditional job there is a MoveTask.
> Even though Hive decides that the conditional merge job is no needed, it seems the MoveTask
is still added to the plan.
> Seems this extra {{MoveTask}} may have been added intentionally. Not sure why yet. The
{{ConditionalResolverMergeFiles}} says that one of three tasks will be returned: move task
only, merge task only, merge task followed by a move task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message