hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jordan Mendelson (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-9454) Support multipart uploads for s3native
Date Thu, 04 Apr 2013 05:19:19 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jordan Mendelson updated HADOOP-9454:
-------------------------------------

    Status: Patch Available  (was: Open)

Here is a patch against trunk which adds multipart upload support. It also updates the jets3t
library to 0.90 (based on a patch in HADOOP-8136).

It is difficult to build automated tests against this as it requires a valid s3 access key
in order to test writing to S3 buckets, however I verified that it does indeed allow you to
upload more than 5 GB files (tested by uploading an 8 GB image image of my root filesystem,
renaming it on S3 (requires a multipart upload copy), downloading it and comparing the md5sum)
and continues to work as normal if fs.s3n.multipart.uploads.enabled is set to false and have
run through various fs commands to verify that everything works as it should.

This patch adds two config options: fs.s3n.multipart.uploads.enabled and fs.s3n.multipart.uploads.block.size.
The former was named after the Amazon setting which does the same thing (defaults to false).
The latter controls the minimum filesize and the block size before multipart file uploads
are used (default, 64 MB).

By default, jets3t will only spawn two threads to upload, but you can change this by setting
the threaded-service.max-thread-count property in jets3t.properties file. I've tried with
upwards of 20 threads and it is significantly faster.

This patch should work with only minor changes with older versions of hadoop as well since
the s3native and s3 filesystems haven't changed much. I originally wrote it for CDH 4.

Please note, because of the way hadoop fs works, it requires a remote copy which for large
files takes a while. This is because hadoop fs copies files with as filename._COPYING_ and
then renames it. Unfortunately, there is no rename support on Amazon S3, so we must do a copy()
then delete(). The copy() can take quite a while for large files. Also because of this, when
multipart files is enabled, an additional request will be made to AWS when doing a copy to
check if the source file is larger than 5 GB.
                
> Support multipart uploads for s3native
> --------------------------------------
>
>                 Key: HADOOP-9454
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9454
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Jordan Mendelson
>
> The s3native filesystem is limited to 5 GB file uploads to S3, however the newest version
of jets3t supports multipart uploads to allow storing multi-TB files. While the s3 filesystem
lets you bypass this restriction by uploading blocks, it is necessary for us to output our
data into Amazon's publicdatasets bucket which is shared with others.
> Amazon has added a similar feature to their distribution of hadoop as has MapR.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message