hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Mackrory (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14041) CLI command to prune old metadata
Date Fri, 03 Feb 2017 00:42:51 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850820#comment-15850820

Sean Mackrory commented on HADOOP-14041:

Been thinking about it some more and cleaning up directories is very tricky. One problem is
that since we don't put a mod_time on directories (presumably just because S3 doesn't?) so
it's impossible to distinguish between a directory that has existed for a long time and has
had all of it's contents pruned, vs. a directory that was just created recently and had no
contents to prune (yet). Putting a mod_time on a directory could be done in 2 days: we could
just use that as a creation time, or a time when it's list of children changed. If it's only
used for deciding when to prune old metadata, using it as creation time allows us to clean
very old directories that don't have more recent children without the overhead of updating
it every time we add or modify a child. But that might be a bit of a departure from the meaning
expressed by "modification time".

I'm thinking a couple of things:

1) For now, I think I'll just prune directories that did have contents, but are now completely
empty post-prune. Later, maybe we can add mod_time for directories and clean up directories
that are old enough to be pruned and are empty, even though they didn't have children removed
in the prune. The more I think about it, the more I think that will be rare and not worth
adding mod_time to all directories just to clean it up more nicely.

2) Having thought about the gap between identifying files to prune and which directories to
prune, it's probably better to do this in very small batches. It's okay for this prune command
to take a longer time to run because we're making many round trips. The benefit of that is
we minimize the window in which files can get created in a directory that is being cleaned
up and might be considered empty. It also minimized impact on other workloads.

So ultimately I'm thinking the best way to do this is to clean up directories that did have
children but had them all pruned (and THEIR parents if the same is now true of the parent
directory), and to do this in very small batches or even individually. The more I think about
it, it's probably not worth adding mod_time to directories to handle this any more completely.
Would love to hear others' input, though.

> CLI command to prune old metadata
> ---------------------------------
>                 Key: HADOOP-14041
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14041
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>         Attachments: HADOOP-14041-HADOOP-13345.001.patch
> Add a CLI command that allows users to specify an age at which to prune metadata that
hasn't been modified for an extended period of time. Since the primary use-case targeted at
the moment is list consistency, it would make sense (especially when authoritative=false)
to prune metadata that is expected to have become consistent a long time ago.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message