hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination
Date Wed, 09 Nov 2016 00:38:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15649356#comment-15649356
] 

Sahil Takiar commented on HIVE-14271:
-------------------------------------

We might want to consider re-opening this ticket, but changing the original approach. To clarify,
right now the FileSinkOperator (FSOP) will always write all its data to a scratch directory.
The FSOP first writes to a {{outPaths}} and then renames the data to {{finalPaths}}, but all
the data is still under the scratch directory. No data is exposed to users or future ETL jobs
yet.

There are two different ways to modify this to improve performance on S3:

1: FSOP implements the "direct output committer" strategy (similar to HIVE-1620) and all data
is written directly to the final table location, no data is written to a staging file or in
the scratch directory. Hive's MoveTask (which runs in HiveServer2) does nothing.

2: FSOP writes data to a scratch directory, but it doesn't write to {{outPaths}} it writes
to {{finalPaths}} instead (remember both of these directories are still under the scratch
directory). Hive's MoveTask (which runs inside HiveServer2) copies the data from the scratch
directory to the final table location. The FSOP writes directly to the final location in the
scratch directory, no writing to a temp file is done. This improves performance since it avoids
copying data from {{outPaths}} to {{finalPaths}}.

For reasons stated in earlier comments, there are a number of issues with approach 1. Implementing
approach 2 should be better, and should improve performance significantly.

> FileSinkOperator should not rename files to final paths when S3 is the default destination
> ------------------------------------------------------------------------------------------
>
>                 Key: HIVE-14271
>                 URL: https://issues.apache.org/jira/browse/HIVE-14271
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished writing
all rows to a temporary path. The problem is that S3 does not support renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to outPaths,
then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add better performance
calls, but we should take care of the cleanup part in case of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message