hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-13560) S3ABlockOutputStream to support huge (many GB) file writes
Date Mon, 26 Sep 2016 20:48:21 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Steve Loughran updated HADOOP-13560:
    Status: Patch Available  (was: Open)

Commit fc16e03c; Patch 005. Moved all the operations in the block output stream which directly
interacted with the s3 client into a new inner class of S3AFilesSystem, WriteOperationState.
This cleanly separates interaction between the output stream —buffering of data and queuing
of uploads— from the upload process itself. I think S3Guard may be able to do something
with this, but I also hope to use it as a start for async directory list/delete operations;
this class would track create-time probes, and initiate the async deletion of directory objects
after a successful write. That's why there are separate callbacks for writeSuccessful and
writeFailed...we will only want to spawn off the deletion when the write succeeded. 

In the process of coding all this, managed to break multipart uploads: this has led to a clearer
understanding of how part uploads fail, an improvement in statistics collection and in the
* trying to get the imports in sync with branch-2; IDE somehow rearranged things.
* docs in more detail
* manual testing through all the FS operations
* locally switched all the s3a tests into using this (i.e. turned on block output in auth-keys.xml)

I think this is ready for review and play. I'd recommend the disk block buffer except in the
special case that you know that you can upload data faster than you can generate, and you
wan't to bypass the disk. But I'd be curious about performance numbers there, especially on
distcp operations with s3a as the destination

> S3ABlockOutputStream to support huge (many GB) file writes
> ----------------------------------------------------------
>                 Key: HADOOP-13560
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13560
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-13560-branch-2-001.patch, HADOOP-13560-branch-2-002.patch,
HADOOP-13560-branch-2-003.patch, HADOOP-13560-branch-2-004.patch
> An AWS SDK [issue|https://github.com/aws/aws-sdk-java/issues/367] highlights that metadata
isn't copied on large copies.
> 1. Add a test to do that large copy/rname and verify that the copy really works
> 2. Verify that metadata makes it over.
> Verifying large file rename is important on its own, as it is needed for very large commit
operations for committers using rename

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message