hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11183) Memory-based S3AOutputstream
Date Sat, 21 Feb 2015 13:00:15 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14330213#comment-14330213

Steve Loughran commented on HADOOP-11183:

I have done a code review, but not actually used this yet. I'm worried it's being looked at
a bit late for the 2.7 branch, but  I can't fault anyone but myself and other reviewers for

If we get it in, I'd like the markdown docs to highlight "experimental, unstable, use at own
risk" for this option. Then 2.7 can be viewed as stabilisation of it, with goal being trusted
by 2.8 & that warning removed. Perhaps a special subsection of s3a, "experiment fast output
stream" and some specifics, rather than just listing a config flag, "fast", which everyone
is going to turn on based on its name alone.

h3. constructor
# check for and reject negative buffer size. Maybe have a check & Warn/reject too small
a buffer?
# S3AFastOutputStream constructor should log @debug, not info

h3. {{write()}}
# would benefit from some comments explaining the two branches, e.g. "copy into buffer" and
"buffer full, write it" + "recursively write remainder"
# If I created a 1-byte buffer, would that tail recursion trigger a stack overflow? If so,
some checks in ctor may be good, such as just hard coded minimum & upping low values to

h3. {{close()}}

I'd swap the operations in the finally clause, so no matter what the superclass does, this
close operation always finishes. Better yet, set close() earlier.

h3. {{uploadBuffer()}}
make private

h3. {{putObject()}}
InterruptedException shouldn't be swallowed here, it hides problems and can cause process
shutdowns to hang. Either convert to InterruptedIOException, or set interrupted flag on the
current thread again.

h3. {{waitForAllPartUploads()}}
Need a policy on interrupts here too. Imagine that communications with a remote S3 server
are blocked, something has just sent a kill -3 to the client app, and is now waiting for it
to finish cleanly...

h3. {{ProgressableListener}}
can me made private

h3. Other. 

# Can we have {{flush()}} do a partial write? Would that gain anything in durability terms?
At the very least, it may make timing of writes more consistent with other filesystems.
# I can think of some more tests here. If {{S3AFastOutputStream}} added a counter of #of uploads,
a test could grab that stream and verify that writes triggered it, flushes triggered it, 0-byte-writes
didn't, close cleaned up, etc. Also be interesting to time & compare classic vs fast operations
& print that in the test results. 

> Memory-based S3AOutputstream
> ----------------------------
>                 Key: HADOOP-11183
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11183
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.6.0
>            Reporter: Thomas Demoor
>            Assignee: Thomas Demoor
>         Attachments: HADOOP-11183-004.patch, HADOOP-11183-005.patch, HADOOP-11183.001.patch,
HADOOP-11183.002.patch, HADOOP-11183.003.patch, design-comments.pdf
> Currently s3a buffers files on disk(s) before uploading. This JIRA investigates adding
a memory-based upload implementation.
> The motivation is evidently performance: this would be beneficial for users with high
network bandwidth to S3 (EC2?) or users that run Hadoop directly on an S3-compatible object
store (FYI: my contributions are made in name of Amplidata). 

This message was sent by Atlassian JIRA

View raw message