hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Fabbri <fab...@cloudera.com>
Subject Re: Getting close to a vote on a merge of S3Guard., HADOOP-13345
Date Wed, 16 Aug 2017 21:17:35 GMT
On Wed, Aug 16, 2017 at 1:39 PM, Andrew Wang <andrew.wang@cloudera.com>

> Hi Steve,
> What's the target release vehicle, and the timeline for merging this? The
> target date for beta1 is mid-September, so any large code movements make me
> nervous.

I think this is ready to get in before beta1.  Most of upstream s3a dev has
been happening on this branch so it has a lot of improvements and testing.

> Could you comment on testing and API stability of this branch? I'm trusting
> the judgement of the contributors involved, since there isn't much time to
> fix things before beta1.
We've done a ton of testing on this branch:

- List consistency tests with failure injection. (HADOOP-13793) This
integration test forces a delay in visibility of certain files by wrapping
the AWS S3 client. It asserts listing is consistent. The test fails without
S3Guard, and succeeds with it.

- All existing S3 integration tests with and without S3Guard. The
filesystem contract tests have been invaluable here. (HADOOP-13589 makes
these very easy to run).

- MetadataStore contract tests that ensure that the API semantics of the
DynamoDB and in-memory reference implementations are correct.

- MetadataStore scale tests that can be used to force DynamoDB service
throttling and ensure we are robust to that.

- Unit tests for different parts of the S3Guard logic.

As you probably know, at Cloudera we are using this codebase in production,
and have run all of our downstream tests including Hive, Spark, Impala on
the new S3A client code, with and without S3Guard enabled.

In terms of API compatibility, the new features sit behind the FileSystem /
FileContext APIs, which have not changed.  Applications don't require any
changes.  Internal APIs for S3Guard, such as MetadataStore (currently
private / evolving), should be properly annotated already.  The S3Guard
work has been active for quite a while now, so the APIs are fairly stable
in practice.

Probably my biggest goal in writing the S3AFileSystem integration code
(HADOOP-13651) was to preserve existing logic and correctness when S3Guard
is not enabled.  One design choice which has worked well was to define a
"null" implementation of the MetadataStore (the API that filesystem clients
use to log metadata changes):


This is used in S3A by default. This made it easier to reason about
correctness and minimized the size of the diff to the FS client as well.

Other questions welcomed!


> Andrew
> On Wed, Aug 16, 2017 at 5:25 AM, Steve Loughran <stevel@hortonworks.com>
> wrote:
> >
> > FYI, We're getting ready for a patch to merge the current S3Guard branch,
> > HADOOP-13345, via a patch https://issues.apache.org/
> > jira/browse/HADOOP-13998
> >
> > After that's done, we do plan to have a second iteration, work on a
> > 0-rename committer (HADOOP-13786) with all the other tuning and
> > improvements; We'd add a new uber-JIRA & move stuff over, maybe branch,
> > and/or do things patch-by-patch .
> >
> > Anyway, now is a great time for people to download and play
> >
> > https://github.com/apache/hadoop/blob/HADOOP-13345/
> > hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3guard.md
> >
> > testing this
> >
> > https://github.com/apache/hadoop/blob/HADOOP-13345/
> > hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md
> >
> > The Inconsistent AWS Client is also something everyone is free to use for
> > injecting inconsistencies (and soon faults) into their own apps by way of
> > 2-3 config options. Want to know how your code handles S3A being
> observably
> > inconsistent? We'll let you do that.
> >
> > -Steve
> >
> >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message