hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-14028) S3A block output streams don't delete temporary files in multipart uploads
Date Wed, 15 Feb 2017 15:33:41 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Steve Loughran updated HADOOP-14028:
    Attachment: HADOOP-14028-branch-2.8-005.patch

patch 005.

File references are passed in direct to s3 multipart puts, leaving it to deal with the reset
on failure logic. This isn't done for single part puts, as that can only apparently be done
at the expense of creating custom metadata, which we need so as to set encryption keys &c.
Essentially, {{DataBlock.startUpload()}} moves from returning a stream to a {{BlockUploadData}}
structure containing a stream *or* a file. In a multipart put, the file is explicitly picked
up. For a single put, {{BlockUploadData.asInputStream()}} is called either to return that
input stream or to open one for the file.

# been through the code, method names and javadocs in the {{S3ADataBlocks}} file to make it
consistent with current behaviour.
# addressed aaron's comments about duplicate close() calls; both single and multipart puts
will close things in the finally clause
# Tested: s3a ireland, 128MB scale, all well apart from that intermittent root delete consistency

(my laptop died last week and I'm only slowly recovering with a dev setup...testing has been
someone complicated here, and more review & testing very much appreciated)

> S3A block output streams don't delete temporary files in multipart uploads
> --------------------------------------------------------------------------
>                 Key: HADOOP-14028
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14028
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 2.8.0
>         Environment: JDK 8 + ORC 1.3.0 + hadoop-aws 3.0.0-alpha2
>            Reporter: Seth Fitzsimmons
>            Assignee: Steve Loughran
>            Priority: Critical
>         Attachments: HADOOP-14028-branch-2-001.patch, HADOOP-14028-branch-2.8-002.patch,
HADOOP-14028-branch-2.8-003.patch, HADOOP-14028-branch-2.8-004.patch, HADOOP-14028-branch-2.8-005.patch
> I have `fs.s3a.fast.upload` enabled with 3.0.0-alpha2 (it's exactly what I was looking
for after running into the same OOM problems) and don't see it cleaning up the disk-cached
> I'm generating a ~50GB file on an instance with ~6GB free when the process starts. My
expectation is that local copies of the blocks would be deleted after those parts finish uploading,
but I'm seeing more than 15 blocks in /tmp (and none of them have been deleted thus far).
> I see that DiskBlock deletes temporary files when closed, but is it closed after individual
blocks have finished uploading or when the entire file has been fully written to the FS (full
upload completed, including all parts)?
> As a temporary workaround to avoid running out of space, I'm listing files, sorting by
atime, and deleting anything older than the first 20: `ls -ut | tail -n +21 | xargs rm`
> Steve Loughran says:
> > They should be deleted as soon as the upload completes; the close() call that the
AWS httpclient makes on the input stream triggers the deletion. Though there aren't tests
for it, as I recall.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message