hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Mackrory (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
Date Fri, 17 Feb 2017 18:21:41 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Sean Mackrory updated HADOOP-14041:
    Attachment: HADOOP-14041-HADOOP-13345.006.patch

Thanks for the reviews, all - good stuff.

The problems [~fabbri] saw boil down to 2 things, one of which I fixed: I had not tested this
with anything being inferred from an S3 path, and I wasn't trying to parse and use that like
the other commands. That is now fixed and added to the tests. The other thing is that it appears
to not be parsing generic options (which does indeed seem wrong - according to the docs, if
you implement Tool you should get that for free - and we do), but the behavior wouldn't be
what you expect anyway because it will set the table config based on the -m flag or the S3
path you provide. I think the CLI behavior is badly defined here in general, so I've filed
HADOOP-14094 to really rethink what options are exposed and how.

I like [~stevel@apache.org]'s recommendation to just throw the IOException. I think what I
was thinking was that if there's an issue deleting one row, we can keep retrying the others.
But I think an exception that affects one row but not subsequent others is probably unlikely,
so it's worth bubbling that up so we know about the problem. Also, removing that block highlighted
that my batching logic was bad: instead of processing complete batches inside the loop and
processing whatever is left over afterwards, I was effectively always processing whatever
contents the batch had at the end of each iteration. That's been fixed, and I tested the number
of events was correct with several hundred objects getting pruned.

On a related note, I also changed the log message to INFO and had it count items and report
batch size rather than just the number of batches. Without that the last message you get out-of-the-box
on the CLI is that the metastore has been initialized, which is misleading. It will now log
when the metadatastore connection has been initialized and then finish off by logging how
many items were deleted and what he batch size was. I think that's more friendly: and probably
something we want to do more of for the other commands if / when we rethink the interface.

> CLI command to prune old metadata
> ---------------------------------
>                 Key: HADOOP-14041
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14041
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>         Attachments: HADOOP-14041-HADOOP-13345.001.patch, HADOOP-14041-HADOOP-13345.002.patch,
HADOOP-14041-HADOOP-13345.003.patch, HADOOP-14041-HADOOP-13345.004.patch, HADOOP-14041-HADOOP-13345.005.patch,
> Add a CLI command that allows users to specify an age at which to prune metadata that
hasn't been modified for an extended period of time. Since the primary use-case targeted at
the moment is list consistency, it would make sense (especially when authoritative=false)
to prune metadata that is expected to have become consistent a long time ago.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message