hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lei (Eddy) Xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13449) S3Guard: Implement DynamoDBMetadataStore.
Date Thu, 27 Oct 2016 22:12:58 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613426#comment-15613426
] 

Lei (Eddy) Xu commented on HADOOP-13449:
----------------------------------------

Good discussion, [~liuml07] and [~fabbri]

bq. The contract assumes we create the direct parent directory (other ancestors should be
taken care of by the clients/callers) when putting a new file item. I checked the in-memory
local metadata store and it implements this idea. This may be not efficient to DDB. Basically
for putting X items, we have to issue 2X~3X DDB requests (X for putting file, X for checking
its parent directories, and possible X for updating its parent directories). I'm wondering
if we can also let the client/caller pre-create the direct parent directory as other ancestors.

I suggest to consider this into two aspects: 
* Checking parents directories in normal {{S3AFileSystem}} operations  (i.e., create / mkdirs
). In such case, S3AFileSystem should already ensure the invariant of the contracts´╝łthe
parent directories existed before S3AFileSystem starts to create files on S3). 
* Loading files and directories outside of normal {{S3AFileSystem}} operations, e.g., load
a *non-cached* directory or from CLI tool, in such cases, would a small local "dentry_cache"
types of data structure be sufficient for a batch operation? Because these operations can
ensure that the namespace structure exists on S3 already. 

The last resort is, if {{S3AFileSystem}} considers that it is safe to {{create / mkdir}} on
a path. You can always create all its parent directories in a single batch to dynamodb. In
short, I'd suggest to let {{S3AFileSystem}} ensure the contract. 

bq. We store the is_empty for directory in the DynamoDB (DDB) metadata store now. We have
to update this information in a consistent and efficient way. We don't want to check the parent
directory every time we delete/put a file item. At least we can optimize this when deleting
a subtree.

Another way to do it is letting the {{isEmpty()}} flag being set by issuing a small _additional_
query on the directory with a [Limit=1|http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#ScanQueryLimit].
So if it returns more than 1 result, the {{isEmpty}} flag is false, otherwise, the flag is
true. And this value can be cached with the lifetime of {{S3AFileStatus}}, as it can not reliably
reflect the changes in S3 anyway. So the query cost only occurs when you call the {{IsEmpty()}}
for the first time. And you don't need to update this flag for any S3 writes. 

Hope that works.

> S3Guard: Implement DynamoDBMetadataStore.
> -----------------------------------------
>
>                 Key: HADOOP-13449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13449
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Chris Nauroth
>            Assignee: Mingliang Liu
>         Attachments: HADOOP-13449-HADOOP-13345.000.patch, HADOOP-13449-HADOOP-13345.001.patch
>
>
> Provide an implementation of the metadata store backed by DynamoDB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message