hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13786) Add S3Guard committer for zero-rename commits to S3 endpoints
Date Fri, 20 Oct 2017 16:56:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212876#comment-16212876

Steve Loughran commented on HADOOP-13786:

Aaron. thanks for all your review, I'll try to go through them here.

I'm about to create a sub JIRA & apply my work as a PR to it, so that we can try doing
reviews on github without the noise overloading this JIRA. For now though, here's my review
of your comments

h3. first set of comments

findbugs.xml. corrected. Issue is about a stream not being closed; the library called does

bq. why copy/paste JsonSerDeser versus import?

the previous one is in a yarn JAR not on the CP for hadoop common. The existing class is now
a subclass of the new one, with some
extra stuff to handle parsing data off zookeeper records.

bq. fs.s3a.retry.throttle.interval defaults

good point. Let me seek some advice. I'm less worried about thundering herds than the total
#of requests/second, though on different S3 implementations lock-step clients may be an issue.
I'll make it 500 for now. It's with an RetryUpToMaximumCountWithProportionalSleep policy,
so we could go to say, 250 & have it go sleep(250), sleep(500)...sleep(750)... until max-retries.
There's no jitter in that class though.

bq. PathOutputCommitterFactory & else {} clause with nothing but comments.


bq. Update comment: "Verify that if the output path is null."


bq. fs.s3a.retry.limit vs fs.s3a.attempts.maximum

Actually, there's no good reason for the difference is there, other than the first was used
in the AWS SDK only? I'll use attempts.maximum; picked up the same default value (20). Updated:
Code, docs, core-default.xml

bq. What if SDK bug yields NPE, for example? Then you NPE in your code in translateException()
guts, no?

bq. Retries stuff great to see, but the change is invasive and needs separate careful consideration
IMO. How painful would this be to pull out into separate patch?

Pretty major. I started with it for all the new commit operations, so a commit wouldn't fail
from a transient event, & then addeed fault injection of throttle events. At which point
all the tests got into a mess as setup/teardown and every existing call (getFileStatus())
failed. Essentially: you can't have a robust committer without making all the ops fault tolerant.
Also goes with the move of WriteOperationsHelper. That said, the DDB retry logic is something
separate; its not something I've yet got fault injection here.

bq. First question here is, what about SDK retries? I know the SDK documentation and behavior
is spotty at best, but there are some known things (i.e. DynamoDB does retry in SDK for most
operations, exclusive of batched operations which are application responsibility).

I don't see that. In fact, I've seen DDB fail on throttling. Use the CLI to crank back the
IOP capacity to, say, 5 read + 5 write and then run the integration tests in parallel with
-Ds3guard -Ddynamodb. You will get overloaded. That's why I added the s3guard CLI to list/change
IOPs...bored of going near the web UI and expecting it to be a recurrent support call. 

I do know that AWS s3 transfer manager does retries using mark/reset to handle retry of part
uploads, and older SDK's didn't handle failure of the final post of an MPU (HADOOP-14028).
The code as stand, assumes that transfer manager handles failures of async PUT/POST itself,
but retries all other failures, with the @attrs to make clear to methods calling other methods
when they should *not* be doing any more retry or translate themselves. If you wrap retry
with retry you end up in a situation where permanent failures take too long to surface.

bq. (re idempotent delete ) "interesting. Is the argument that delete is racy anyways, even
if clients externally synchronize, thus we treat it as idempotent even though it is not?"

There's a past WONTFIX JIRA when someone (Konstantin?) wanted to make delete idempotent on
HDFS and it was agreed, no, never. Here I have made it idempotent on the basis that object
stores are odd anyway. In particular, code which uses create(overwrite=false) can get into
conflict with delete, but as create-no-overwrite isn't atomic in S3, do we care. Then there's
recursive delete.

Here, I don't know what to do. It is a lot easier if there is retry round delete, but it may
be something done at a higher level than the FS APIs.

bq. re: LOG.info() of failed to purge. 

That log comes from HADOOP-13208 & failing if you ever tried to start up against landsat-pds
with purge enabled. I'll bump it up to INFO for now & see what people say. Turned off
for landsat in the test/resources/core-site.xml file too.

bq.  @Retries.RetryRaw

Thanks for the appreciation. I started off with javadocs but it wasn't that reliable...annotations
are clearer. We have them in {{org.apache.hadoop.io.retry}} too, but they are actually declaring
method semantics for IPC to choose how to react. This is just docs. I think they may ->
the javadocs, but not checked. bigget risk: code changes make the annotations obsolete &
it is only picked up by manual review (and ideally fault injection)

bq. org.apache.hadoop.fs.s3a.commit.Tasks

Ryan's code; I've not changed other than to move to lambda-expressions in invocations. Some
of the tests try the parallel vs sequential execution within committers though, in particular
TestStagingCommitter, which does explore the various failure paths.

But yes, I could do a standalone test to for more low-level validation...let me add that to
my todo list.

bq. How about passing a Logger into DurationInfo instead of it using its own?

Good idea. Done. 

BTW, been thinking about Htrace support here. I won't do it in this patch, but its the obvious
place to declare spans. For that though: start in S3AFS, have consistent span names across
operations in hadoop-common, then move up to commit, where you also need your execution engine
to embrace it too.

bq. cleanup/cleanupStagingDirs

Made abstract for you. I do need to do some more walkthroughs to be happy that we are doing
as much cleanup as possible.
Interesting issue there: should we do a best-effort cleanup on failure of, say task commit,
or rely on abortTask/abortJob to force that?

bq. Failure callback (in abortPendingUploads) is same as intended operation. Is this in effect
a single retry?

good point. Something probably taken from the Netflix code. I'll cut it, as there is retry
logic in the abort now anyway.

bq. javadocs on AbstractS3GuardCommitterFactory

fixed to be consistent with code.

bq. innerCommit & AtomicInteger

gets passed in to a closure which is then handed off to the retry stuff to call on every retry.
It's not that an atomic is needed, just something to hold an integer which can be incremented
in closures. You can see that in other places/code elsewhere too, like in local vars you want
closures to updated. If there's a non-atomic version, especially something in the JDK (and
*not* guava) then I'd be happy to switch.

bq. CommitOperations.ls & "robust in JDK"

It used to be {{invoke.retry(()-> fs.listFiles(path.recursive))}}, and while the listFiles
call retried, the hasNext/next calls weren't. With the move of retries into the FS, both ops
are robust & so the call has been simplified. I'll change the javadocs.

bq. Paths & Unit tests

There's some stuff in {{TestStagingPartitionedFileListing}}, otherwise nothing specific except
in the mock & integration tests for the staging committers.

bq. retries around dynamoDB.batchWriteItemUnprocessed()

OK, removed retry logic. There still aspects of its failures I think I need to see more of
to understand


h3. docs

Thanks for looking at the docs. I am still going through them as I think I've got some duplication.
I concur with your statement that this is better than the MR docs; spent a lot of time stepping
through tests to work out WTF FileOutputCommitter really does, and how MapReduce uses it.
It's made me a strong enthusiast for deleting the v1 APIs entirely. **It is time**.

bq.  I do like "successed" though, colloquially

Once you become a CS professor you can start using enough in talks that people will think
it's a real term, especially if you add some extra semantics around observed state. A process
may *succeed* locally, but it is only *successed* when the rest of the system observes this
and moves their state model into a succeeded state. Yes, we could do this...

For now, fixed the docs.

bq. /S3Guard committers/S3A committers/ 

dunno. The magic one utterly depends on S3Guard for consistency, so using the term keeps it
clear, also helps with the branding of that. The staging ones don't have that 

bq. Nice Trick (re setting `mapreduce.fileoutputcommitter.algorithm.version` = 10 to fail

yeah, guess how I came up with that. 

bq. Multiple clients racing to write same path cannot detect which one actually succeeded
via return code (IIRC this is determined by S3 service-side timestamp and not visible to clients).
Am I right or mistaken here?

It's not that >1 client "succeeds" so much as they both succeed, you just don't know which
one came last, and even checking afterwards for timestamp or etag doesn't guarantee there's
not an overwrite in progress.

bq clarity nit: /implementations of the S3 protocol/S3 protocol-compatible storage systems/

changed to "Some S3-compatible object stores are fully consistent;"

bq. > It requires a path element, such as `__magic` which cannot be used for any purpose
other than for the storage of pending commit data.
bq. Aside: We could make this string configurable in the future, no?

no, it will only cause chaos, confusion and support calls. I chose a path that nobody appears
to use today; if ever you get a bug report with __magic in the path you can see what the issue
is related to. Whereas if something could say _temporary was "magic", there'd be chaos and
you'd never work out why. Your support team will never forgive you for even suggesting this,
and QE will not be sending you christmas cards.

bq. > skipDuringFaultInjection(fs);
bq. Nice.

you can do the entire suite with throttling cranked up; how I got both the fault injection
& retry coverage right.

bq. > log4j.logger.org.apache.hadoop.fs.s3a=DEBUG
bq. Seems a little heavy.

noted; cut back

Final bit of the architecture doc: yeah, it's out of date. I've already updated the spark
bit, but it needs a review and edit. Like you noted, its got design work in there oo.

h3. Other issues.

Writing up this stuff here and elsewhere makes me think the term "directory committer" is
a bit confusing. What about we call that, at least in docs & config the "staging" committer,
with the partitioned committer its sibling. No end users should be seeing the superclass after
all. Similarly, renaming the DynamicFactory the S3GuardCommitterFactory (or S3ACommitterFactory),
and only write up the use of that. It's the easiest to set up and explain. In particular we
could make it the default for the S3A schema, just using the default fs.s3a.committer=file
for a bucket for today's (flawed) behaviour. Change the committer name globally or per bucket,
and you've got fewer config options to get wrong/try and debug.

Next patch will have more logging at init time, primarly related to debugging things in some
new bulk spark integration. Filed HADOOP-14965 for one key issue, but expect some more logging
@debug in the committers too.

> Add S3Guard committer for zero-rename commits to S3 endpoints
> -------------------------------------------------------------
>                 Key: HADOOP-13786
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13786
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs/s3
>    Affects Versions: 3.0.0-beta1
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: HADOOP-13786-036.patch, HADOOP-13786-037.patch, HADOOP-13786-038.patch,
HADOOP-13786-039.patch, HADOOP-13786-HADOOP-13345-001.patch, HADOOP-13786-HADOOP-13345-002.patch,
HADOOP-13786-HADOOP-13345-003.patch, HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch,
HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-007.patch,
HADOOP-13786-HADOOP-13345-009.patch, HADOOP-13786-HADOOP-13345-010.patch, HADOOP-13786-HADOOP-13345-011.patch,
HADOOP-13786-HADOOP-13345-012.patch, HADOOP-13786-HADOOP-13345-013.patch, HADOOP-13786-HADOOP-13345-015.patch,
HADOOP-13786-HADOOP-13345-016.patch, HADOOP-13786-HADOOP-13345-017.patch, HADOOP-13786-HADOOP-13345-018.patch,
HADOOP-13786-HADOOP-13345-019.patch, HADOOP-13786-HADOOP-13345-020.patch, HADOOP-13786-HADOOP-13345-021.patch,
HADOOP-13786-HADOOP-13345-022.patch, HADOOP-13786-HADOOP-13345-023.patch, HADOOP-13786-HADOOP-13345-024.patch,
HADOOP-13786-HADOOP-13345-025.patch, HADOOP-13786-HADOOP-13345-026.patch, HADOOP-13786-HADOOP-13345-027.patch,
HADOOP-13786-HADOOP-13345-028.patch, HADOOP-13786-HADOOP-13345-028.patch, HADOOP-13786-HADOOP-13345-029.patch,
HADOOP-13786-HADOOP-13345-030.patch, HADOOP-13786-HADOOP-13345-031.patch, HADOOP-13786-HADOOP-13345-032.patch,
HADOOP-13786-HADOOP-13345-033.patch, HADOOP-13786-HADOOP-13345-035.patch, cloud-intergration-test-failure.log,
objectstore.pdf, s3committer-master.zip
> A goal of this code is "support O(1) commits to S3 repositories in the presence of failures".
Implement it, including whatever is needed to demonstrate the correctness of the algorithm.
(that is, assuming that s3guard provides a consistent view of the presence/absence of blobs,
show that we can commit directly).
> I consider ourselves free to expose the blobstore-ness of the s3 output streams (ie.
not visible until the close()), if we need to use that to allow us to abort commit operations.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message