hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Fabbri (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-13449) S3Guard: Implement DynamoDBMetadataStore.
Date Thu, 27 Oct 2016 21:43:59 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613313#comment-15613313
] 

Aaron Fabbri edited comment on HADOOP-13449 at 10/27/16 9:43 PM:
-----------------------------------------------------------------

Exciting stuff, thanks for update.

{quote}
I changed the base unit test as the owner, group and permission etc are not part of the metadata
we're interested in by now.
{quote}

Good. We could have a helper function that all tests could use, e.g. doesMetadataStorePersistOwnerGroupPermission()
which returns false if MetadataStore instanceof DynamoDBMetadataStore.  This is also another
spot it might be nice to add a function {{getProperty()}} for MetadataStore, so we could {{getProperty(PERSISTS_PERMISSIONS}}
etc.  We could do that later on.

{quote}
We store the is_empty for directory in the DynamoDB (DDB) metadata store now. We have to update
this information in a consistent and efficient way. We don't want to check the parent directory
every time we delete/put a file item. At least we can optimize this when deleting a subtree.
{quote}
This part is a pain.  We should revisit the whole {{S3AFileStatus#isEmptyDirectory}} idea
in the future. 

In case it helps, my algorithm is here:

In put(PathMetadata meta):
{code}
  if we have PathMetadata for meta's parent path:
      parentMeta.setIsEmpty(false)
{code}

The harder case, when we are removing an entry:

{code}

      // If we have cached a FileStatus for the parent...
      DirListingMetadata dir = dirHash.get(parent);
      if (dir != null) {
        LOG.debug("removing parent's entry for {} ", path);

        // Remove our path from the parent dir
        dir.remove(path);

        // S3A-specific logic dealing with S3AFileStatus#isEmptyDirectory()
        if (isS3A) {
          if (dir.isAuthoritative() && dir.numEntries() == 0) {
            setS3AIsEmpty(parent, true);
          } else if (dir.numEntries() == 0) {
            // We do not know of any remaining entries in parent directory.
            // However, we do not have authoritative listing, so there may
            // still be some entries in the dir.  Since we cannot know the
            // proper state of the parent S3AFileStatus#isEmptyDirectory, we
            // will invalidate our entries for it.
            // Better than deleting entries would be marking them as "missing
            // metadata".  Deleting them means we lose consistent listing and
            // ability to retry for eventual consistency for the parent path.

            // TODO implement missing metadata feature
            invalidateFileStatus(parent);
          }
          // else parent directory still has entries in it, isEmptyDirectory
          // does not change
        }
{code}

Fixing the loss of consistency on the parent could be achieved by leaving an empty PathMetadata
for the parent that does not contain a FileStatus in it.  That "missing metadata" PathMetadata
would indicate to future getFileStatus() or listStatus() calls that the file does exist (so
retry if S3 is eventually consistent), but the FileStatus needs to be recreated (the regular
getFileStatus() logic) , since we cannot know the value of its isEmptyDirectory()

I added a TODO because we can tackle this later if we want.

{quote}The contract assumes we create the direct parent directory (other ancestors should
be taken care of by the clients/callers) when putting a new file item{quote}

Yeah this is for consistent listing on the parent after the child is created.  I'm wondering
if we can relax this or make it configurable?  When {{fs.s3a.metadatastore.authoritative}}
is true, the performance hit on create could be offset by a performance gain on subsequent
listing of the parent directory. 

Looks like good progress! Please shout if I can help at all.



was (Author: fabbri):
Exciting stuff, thanks for update.

{quote}
I changed the base unit test as the owner, group and permission etc are not part of the metadata
we're interested in by now.
{quote}

Good. We could have a helper function that all tests could use, e.g. doesMetadataStorePersistOwnerGroupPermission()
which returns false if MetadataStore instanceof DynamoDBMetadataStore.  This is also another
spot it might be nice to add a function {{getProperty()}} for MetadataStore, so we could {{getProperty(PERSISTS_PERMISSIONS}}
etc.  We could do that later on.

{quote}
We store the is_empty for directory in the DynamoDB (DDB) metadata store now. We have to update
this information in a consistent and efficient way. We don't want to check the parent directory
every time we delete/put a file item. At least we can optimize this when deleting a subtree.
{quote}
This part is a pain.  We should revisit the whole {{S3AFileStatus#isEmptyDirectory}} idea
in the future. 

In case it helps, my algorithm is here:

In put(PathMetadata meta):
{code}
  if we have PathMetadata for meta's parent path:
      parentMeta.setIsEmpty(false)
{code}

The harder case, when we are removing an entry:

{code}

      // If we have cached a FileStatus for the parent...
      DirListingMetadata dir = dirHash.get(parent);
      if (dir != null) {
        LOG.debug("removing parent's entry for {} ", path);

        // Remove our path from the parent dir
        dir.remove(path);

        // S3A-specific logic dealing with S3AFileStatus#isEmptyDirectory()
        if (isS3A) {
          if (dir.isAuthoritative() && dir.numEntries() == 0) {
            setS3AIsEmpty(parent, true);
          } else if (dir.numEntries() == 0) {
            // We do not know of any remaining entries in parent directory.
            // However, we do not have authoritative listing, so there may
            // still be some entries in the dir.  Since we cannot know the
            // proper state of the parent S3AFileStatus#isEmptyDirectory, we
            // will invalidate our entries for it.
            // Better than deleting entries would be marking them as "missing
            // metadata".  Deleting them means we lose consistent listing and
            // ability to retry for eventual consistency for the parent path.

            // TODO implement missing metadata feature
            invalidateFileStatus(parent);
          }
          // else parent directory still has entries in it, isEmptyDirectory
          // does not change
        }
{code}

Fixing the loss of consistency on the parent could be achieved by leaving an empty PathMetadata
for the parent that does not contain a FileStatus in it.  That "missing metadata" PathMetadata
would indicate to future getFileStatus() or listStatus() calls that the file does exist (so
retry if S3 is eventually consistent), but the FileStatus needs to be fetched from S3, since
we cannot know the value of its isEmptyDirectory()

I added a TODO because we can tackle this later if we want.

{quote}The contract assumes we create the direct parent directory (other ancestors should
be taken care of by the clients/callers) when putting a new file item{quote}

Yeah this is for consistent listing on the parent after the child is created.  I'm wondering
if we can relax this or make it configurable?  When {{fs.s3a.metadatastore.authoritative}}
is true, the performance hit on create could be offset by a performance gain on subsequent
listing of the parent directory. 

Looks like good progress! Please shout if I can help at all.


> S3Guard: Implement DynamoDBMetadataStore.
> -----------------------------------------
>
>                 Key: HADOOP-13449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13449
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Chris Nauroth
>            Assignee: Mingliang Liu
>         Attachments: HADOOP-13449-HADOOP-13345.000.patch, HADOOP-13449-HADOOP-13345.001.patch
>
>
> Provide an implementation of the metadata store backed by DynamoDB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message