hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-15093) S3-to-S3 Renames: Files should be moved individually rather than at a directory level
Date Thu, 10 Nov 2016 10:32:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15653694#comment-15653694
] 

Steve Loughran edited comment on HIVE-15093 at 11/10/16 10:32 AM:
------------------------------------------------------------------

# I've just started HADOOP-13600, though busy with preparation and attendance at ApacheCon
big data means expect no real progress for the next 10 days
# there's a recent discussion on common dev about when 2.8 RC comes out

as far as HDP goes, all the s3a phase II read pipeline work is in HDP-2.5; the HDP-cloud in
AWS product adds the HADOOP-13560 write pipeline; with a faster update cycle it'd be out the
door fairly rapidly too (disclaimer, no forward looking statements, etc etc). CDH hasn't shipped
with any of the phase II changes in yet, that's something to discuss with your colleagues.
Given the emphasis on Impala & S3, I'd expect it sooner rather than later

Here's [the work in progress|https://github.com/steveloughran/hadoop/blob/s3/HADOOOP-13600-rename/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L802];
as I note in the code, I'm not doing it right. We should have the list and delete operations
working in parallel too, because list is pretty slow too, and I want to eliminate all sequential
points in the code.

I know it's complicated, but it shows why this routine is so much better down in the layers
beneath: we can optimise every single HTTP request to S3a, order the copy calls for maximum
overlapping operations, *and write functional tests against real s3 endpoints*. object stores
are so different from filesystems that testing against localfs is misleading.


was (Author: stevel@apache.org):
#. I've just started HADOOP-13600, though busy with preparation and attendance at ApacheCon
big data means expect no real progress for the next 10 days
# discussion on common dev about when 2.8 RC comes out

as far as HDP goes, all the s3a phase II read pipeline work is in HDP-2.5; the HDP-cloud in
AWS product adds the HADOOP-13560 write pipeline; with a faster update cycle it'd be out the
door fairly rapidly too (disclaimer, no forward looking statements, etc etc). CDH hasn't shipped
with any of the phase II changes in yet, that's something to discuss with your colleagues.
Given the emphasis on Impala & S3, I'd expect it sooner rather than later

Here's [the work in progress|https://github.com/steveloughran/hadoop/blob/s3/HADOOOP-13600-rename/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L802];
as I note in the code, I'm not doing it right. We should have the list and delete operations
working in parallel too, because list is pretty slow too, and I want to eliminate all sequential
points in the code.

I know it's complicated, but it shows why this routine is so much better down in the layers
beneath: we can optimise every single HTTP request to S3a, order the copy calls for maximum
overlapping operations, *and write functional tests against real s3 endpoints*. object stores
are so different from filesystems that testing against localfs is misleading.

> S3-to-S3 Renames: Files should be moved individually rather than at a directory level
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-15093
>                 URL: https://issues.apache.org/jira/browse/HIVE-15093
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Hive
>    Affects Versions: 2.1.0
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>         Attachments: HIVE-15093.1.patch, HIVE-15093.2.patch, HIVE-15093.3.patch, HIVE-15093.4.patch,
HIVE-15093.5.patch, HIVE-15093.6.patch, HIVE-15093.7.patch, HIVE-15093.8.patch, HIVE-15093.9.patch
>
>
> Hive's MoveTask uses the Hive.moveFile method to move data within a distributed filesystem
as well as blobstore filesystems.
> If the move is done within the same filesystem:
> 1: If the source path is a subdirectory of the destination path, files will be moved
one by one using a threapool of workers
> 2: If the source path is not a subdirectory of the destination path, a single rename
operation is used to move the entire directory
> The second option may not work well on blobstores such as S3. Renames are not metadata
operations and require copying all the data. Client connectors to blobstores may not efficiently
rename directories. Worst case, the connector will copy each file one by one, sequentially
rather than using a threadpool of workers to copy the data (e.g. HADOOP-13600).
> Hive already has code to rename files using a threadpool of workers, but this only occurs
in case number 1.
> This JIRA aims to modify the code so that case 1 is triggered when copying within a blobstore.
The focus is on copies within a blobstore because needToCopy will return true if the src and
target filesystems are different, in which case a different code path is triggered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message