hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Fabbri (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13447) S3Guard: Refactor S3AFileSystem to support introduction of separate metadata repository and tests.
Date Thu, 11 Aug 2016 23:16:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418098#comment-15418098

Aaron Fabbri commented on HADOOP-13447:

Great discussion, I'm enjoying it (although an hour with a whiteboard would be even better).

My eventual (no pun intended!) plan was going to be to evolve the interfaces and separation
of responsibilities for AbstractS3AccessPolicy and S3Store such that S3Store never makes its
own internal metadata calls (like the internal getFileStatus calls you mentioned). 

Yeah, I was trying to come up with a way to do this.  My goals are clean / minimal code, and
near-optimal performance (want upstream s3a to be very fast).

The crux of the problem seems that the top level logic of things like mkdirs()/rename() etc.
need to call getFileStatus(), and those internal getFileStatus() calls are subject to the
same source-of-truth and retry policies as the public API.  In this case, the only way I can
think to really separate the top-level logic for FileSystem ops (e.g. mkdir) from the policy
on MetadataStore, is to build some sort of execution plan in top level logic then pass to
a policy layer to execute it.  You'd have three layers, roughly, FileSystem top-level, Raw
S3 I/O, Policy / execution. This seems like it would be slower and way over-complicated (some
steps in execution are conditional, you end up with a pseudo-language).  There is also the
AOP approach that s3mper took, but I think we can do better since we can modify the upstream
code, and our goals are a bit more ambitious (more than just listStatus(), plus ability to
use MetadataStore as source of truth).

 I thought it would introduce complicated control flow in a lot of the S3AFileSystem methods.
However, refactorings like this are always subjective, and it's entirely possible that I was

This is true.  Seems like the classic performance vs. complexity tradeoff.  I don't think
you are wrong at all, just a question of priorities.  I'm willing to sacrifice a little code
separation for significant performance benefit, and possibly a more complete solution (e.g.
if internal getFileStatus() call misses a file above mkdirs(path) because that file creation
was eventually consistent).

If you prefer, we can go that way

That is my preference at this stage.  It was not as bad as I thought it would be when we prototyped
it.  Also I like doing refactoring / generalization after seeing some concrete cases in code
(i.e. start by complicating top level S3AFileSystem logic).

My preference would probably change if I could think of a clean way to handle the internal
getFileStatus() calls.

{quote} and later revisit the larger split I proposed here in patch 001 if it's deemed necessary.{quote}

I think we do need to break up S3AFileSystem.  I'll be very supportive of future refactorization
here, in general.

Really the only portion of patch 001 that I consider an absolute must is the S3ClientFactory
work. I think it's vital to the project that we have the ability to mock S3 interactions to
simulate eventual consistency.

Yes, I like this part of it.  Being able to mock the S3 service out would be awesome, and
I didn't notice any real downside.

> S3Guard: Refactor S3AFileSystem to support introduction of separate metadata repository
and tests.
> --------------------------------------------------------------------------------------------------
>                 Key: HADOOP-13447
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13447
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: HADOOP-13447-HADOOP-13446.001.patch, HADOOP-13447-HADOOP-13446.002.patch
> The scope of this issue is to refactor the existing {{S3AFileSystem}} into multiple coordinating
classes.  The goal of this refactoring is to separate the {{FileSystem}} API binding from
the AWS SDK integration, make code maintenance easier while we're making changes for S3Guard,
and make it easier to mock some implementation details so that tests can simulate eventual
consistency behavior in a deterministic way.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message