hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vladimir Rodionov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-14477) Compaction improvements: Generational compaction policy
Date Sun, 27 Sep 2015 04:22:04 GMT

     [ https://issues.apache.org/jira/browse/HBASE-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Vladimir Rodionov updated HBASE-14477:
    Summary: Compaction improvements: Generational compaction policy  (was: Compaction improvements:
generational compaction)

> Compaction improvements: Generational compaction policy
> -------------------------------------------------------
>                 Key: HBASE-14477
>                 URL: https://issues.apache.org/jira/browse/HBASE-14477
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Vladimir Rodionov
>            Assignee: Vladimir Rodionov
>             Fix For: 2.0.0
> For immutable and mostly immutable data the current SizeTiered-based compaction policy
is not efficient. 
> # There is no need to compact all files into one, because, data is (mostly) immutable
and we do not need to collect garbage. (performance reason will be discussed later)
> # Size-tiered compaction is not suitable for applications where most recent data is most
important and prevents efficient caching of this data. 
> The idea of generational compaction policy is pretty similar to DateTieredCompaction
in Cassandra:
> # Memstore flushes creates files of Gen0.
> # Only store files of the same generation can be compacted. 
> # Once number of files in GenK reaches N (default, 5) they get compacted and one file
of Gen(K+1) is created.
> # Compaction stops at predefined generation M (default, 3).
> Simple math. For the sake of simplicity, let us say that flush size is 30MB.
> Gen0: 4*30 = 120MB 
> Gen1: 4*120 = 480MB
> Gen2: 4*480MB = 1.92GB
> Gen3: R * 1.92GB (Gen3 by default is not compacted)
> With 3-4 files in Gen3 we get total Region size 10-12GB, 10-20% (Gen0, Gen1 and most
of Gen2) can be kept in a block cache.
> Generational compaction does not limit region size, one can use 100GB or even more because
total compaction IO per region can be limited and, generally speaking, does not depend on
region size explicitly (as in Size Tiered compaction policy)
> Now, about performance implications:
> SSD-based servers will benefit this policy because they provide more than adequate random
IO ... but even HDD-based system can use this policy. Again, simple math: with region size
~ 10GB we will have ~ 16 files, of which, 10-12 can be cached in a block cache. Even if request
touches all the files (spans the all time range) it will need to access to only 4-6 files.
How to keep always recent data in a block cache is totally separate topic (JIRA). 

This message was sent by Atlassian JIRA

View raw message