hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clara Xiong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-15400) Using Multiple Output for Date Tiered Compaction
Date Sun, 06 Mar 2016 16:05:40 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182211#comment-15182211
] 

Clara Xiong commented on HBASE-15400:
-------------------------------------

Sorry if I didn't make it clear. If a user load a single bulkload file with large time span
to an empty store with seqId 0 or 1 or a very small number, then call major compaction to
output a date tiered layout. They use the default window size and then we output files more
than the existing seq id. Another extreme case is a user use -1 for bulk load files by explicitly
turning off the configuration assign sequence id. One more extreme case is a user that specify
timestamp for business logic which cover an extremely large time range that requires more
output files than the seq id the first file gets written to an empty store.


> Using Multiple Output for Date Tiered Compaction
> ------------------------------------------------
>
>                 Key: HBASE-15400
>                 URL: https://issues.apache.org/jira/browse/HBASE-15400
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Compaction
>            Reporter: Clara Xiong
>            Assignee: Clara Xiong
>             Fix For: 2.0.0
>
>         Attachments: HBASE-15400.patch
>
>
> When we compact, we can output multiple files along the current window boundaries. There
are two use cases:
> 1. Major compaction: We want to output date tiered store files.
> 2. Bulk load files and the old file generated by major compaction before upgrading to
DTCP.
> Pros: 
> 1. Restore locality, process versioning, updates and deletes while maintaining the tiered
layout.
> 2. The best way to fix a skewed layout.
>  
> This work is based on a prototype of date tiered file writer from HBASE-15389. I have
to call out a few design decisions:
> 1. We only want to output the files along all windows for major compaction. And we want
to output multiple files older than max age in the sizes of the maximum tier window size determined
by base window size, windows per tier and max age.
> 2. For minor compaction, we don't want to output too many files, which will remain around
because of current restriction of contiguous compaction by seq id. I will only output two
files if all the files in the windows are being combined, one for the data within window and
the other for the out-of-window tail. If there is any file in the window excluded from compaction,
only one file will be output from compaction. When the windows are promoted, the situation
of out of order data will gradually improve.
> 3. We have to pass the boundaries with the list of store file as a complete time snapshot
instead of two separate calls because window layout is determined by the time the computation
is called. So we will need new type of compaction request. 
> 4. Since we will assign the same seq id for all output files, we need to sort by maxTimestamp
subsequently. Right now all compaction policy gets the files sorted for StoreFileManager which
sorts by seq id and other criteria. I will use this order for DTCP only, to avoid impacting
other compaction policies. 
> 5. We need some cleanup of current design of StoreEngine and CompactionPolicy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message