hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Latham (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-15454) Archive store files older than max age
Date Tue, 12 Apr 2016 15:40:25 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237401#comment-15237401

Dave Latham commented on HBASE-15454:

Sorry, I think I'm a bit behind Duo and Clara, but I'm still trying to get the picture of
this right in my head.

So it sounds like EC is independent of this work and can be noted purely as motivation to
have large, infrequently accessed files.

* What does it actually mean to "archive" a store file?  Is there a definition, or set of
properties or guarantees?
** Are archived files excluded from major compaction?  Or minor compactions?  Or from region
split size calculation?
** Are archived files guaranteed to have no timestamp overlap with other HFiles?  Or just
other archived HFiles?
** Or does it just refer to any files with max timestamp older than maxAge?
* Should archiving be a separate modality with a separate method or just happen as part of
compaction with the given window schedule?

{quote}I find the first and last files that overlapping with current archive window, and then
compact all files between them. These makes sure that all data belongs to this window are
contained in the output file.{quote}

Is that first and last file ordered by sequence id?  With max timestamp in current archive
window?  What if the seq id ordering and timestamp overlapping don't match up?

I do suspect that it's most efficient to have all windows and tiers in alignment - that if
one desires calendar based files for the archive, that one would be better off using calendar
derived windows for all the data.  For example, if you want the highest tier to be calendar
years, then lower tiers could be 3-month quarters, months, weekOfMonth (some week windows
would not be full 7 days but that should be ok), days, 6-hour blocks.  

Otherwise, the transition of files from one scheme to another seems likely to require splitting
existing data from a file into multiple windows.  Maybe that's OK.

Taking a quick look over Duo's github link there: I like how there is a pluggable window factory.
 I think if we have that we should try to move the window specific configuration out of the
generic CompactionConfiguration into the specific window factory.  Also, I'm not sure if the
intent is for ExponentialThenCalendricalCompactionWindowFactory to be in the hbase code or
it's just there as an illustration of an alternate plugin - I tend to think it should not
be included by default.

As a side note, it seem unfortunate to add joda time as a full dependency when most people
probably won't use tiered compaction, let alone calendar based windows / archives.  Perhaps
using JDK classes would suffice or even direct basic logic in the code?  Or if it's just included
with a window factory plugin then only people using that would need it.

> Archive store files older than max age
> --------------------------------------
>                 Key: HBASE-15454
>                 URL: https://issues.apache.org/jira/browse/HBASE-15454
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Compaction
>    Affects Versions: 2.0.0, 1.3.0, 0.98.18, 1.4.0
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>             Fix For: 2.0.0, 1.3.0, 0.98.19, 1.4.0
>         Attachments: HBASE-15454-v1.patch, HBASE-15454.patch
> Sometimes the old data is rarely touched but we can not remove it. So archive it to several
big files(by year or something) and use EC to reduce the redundancy.

This message was sent by Atlassian JIRA

View raw message