hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vladimir Rodionov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-14477) Compaction improvements: Date tiered compaction policy
Date Thu, 01 Oct 2015 21:34:27 GMT

     [ https://issues.apache.org/jira/browse/HBASE-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vladimir Rodionov updated HBASE-14477:
--------------------------------------
    Description: 
For immutable and mostly immutable data the current SizeTiered-based compaction policy is
not efficient. 

# There is no need to compact all files into one, because, data is (mostly) immutable and
we do not need to collect garbage. (performance reason will be discussed later)
# Size-tiered compaction is not suitable for applications where most recent data is most important
and prevents efficient caching of this data. 

The idea  is pretty similar to DateTieredCompaction in Cassandra:

http://www.datastax.com/dev/blog/datetieredcompactionstrategy
http://www.datastax.com/dev/blog/dtcs-notes-from-the-field

>From Cassandra own blog:

{quote}
Since DTCS can be used with any table, it is important to know when it is a good idea, and
when it is not. I’ll try to explain the spectrum and trade-offs here:

1. Perfect Fit: Time Series Fact Data, Deletes by Default TTL: When you ingest fact data that
is ordered in time, with no deletes or overwrites. This is the standard “time series”
use case.

2. OK Fit: Time-Ordered, with limited updates across whole data set, or only updates to recent
data: When you ingest data that is (mostly) ordered in time, but revise or delete a very small
proportion of the overall data across the whole timeline.

3. Not a Good Fit: many partial row updates or deletions over time: When you need to partially
revise or delete fields for rows that you read together. Also, when you revise or delete rows
within clustered reads.
{quote}

  was:
For immutable and mostly immutable data the current SizeTiered-based compaction policy is
not efficient. 

# There is no need to compact all files into one, because, data is (mostly) immutable and
we do not need to collect garbage. (performance reason will be discussed later)
# Size-tiered compaction is not suitable for applications where most recent data is most important
and prevents efficient caching of this data. 

The idea of generational compaction policy is pretty similar to DateTieredCompaction in Cassandra:

# Memstore flushes creates files of Gen0.
# Only store files of the same generation can be compacted. 
# Once number of files in GenK reaches N (default, 5) they get compacted and one file of Gen(K+1)
is created.
# Compaction stops at predefined generation M (default, 3).

Simple math. For the sake of simplicity, let us say that flush size is 30MB.

Gen0: 4*30 = 120MB 
Gen1: 4*120 = 480MB
Gen2: 4*480MB = 1.92GB
Gen3: R * 1.92GB (Gen3 by default is not compacted)

With 3-4 files in Gen3 we get total Region size 10-12GB, 10-20% (Gen0, Gen1 and most of Gen2)
can be kept in a block cache.

Generational compaction does not limit region size, one can use 100GB or even more because
total compaction IO per region can be limited and, generally speaking, does not depend on
region size explicitly (as in Size Tiered compaction policy)

Now, about performance implications:

SSD-based servers will benefit this policy because they provide more than adequate random
IO ... but even HDD-based system can use this policy. Again, simple math: with region size
~ 10GB we will have ~ 16 files, of which, 10-12 can be cached in a block cache. Even if request
touches all the files (spans the all time range) it will need to access to only 4-6 files.
How to keep always recent data in a block cache is totally separate topic (JIRA). 

 


> Compaction improvements: Date tiered compaction policy
> ------------------------------------------------------
>
>                 Key: HBASE-14477
>                 URL: https://issues.apache.org/jira/browse/HBASE-14477
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Vladimir Rodionov
>            Assignee: Vladimir Rodionov
>             Fix For: 2.0.0
>
>
> For immutable and mostly immutable data the current SizeTiered-based compaction policy
is not efficient. 
> # There is no need to compact all files into one, because, data is (mostly) immutable
and we do not need to collect garbage. (performance reason will be discussed later)
> # Size-tiered compaction is not suitable for applications where most recent data is most
important and prevents efficient caching of this data. 
> The idea  is pretty similar to DateTieredCompaction in Cassandra:
> http://www.datastax.com/dev/blog/datetieredcompactionstrategy
> http://www.datastax.com/dev/blog/dtcs-notes-from-the-field
> From Cassandra own blog:
> {quote}
> Since DTCS can be used with any table, it is important to know when it is a good idea,
and when it is not. I’ll try to explain the spectrum and trade-offs here:
> 1. Perfect Fit: Time Series Fact Data, Deletes by Default TTL: When you ingest fact data
that is ordered in time, with no deletes or overwrites. This is the standard “time series”
use case.
> 2. OK Fit: Time-Ordered, with limited updates across whole data set, or only updates
to recent data: When you ingest data that is (mostly) ordered in time, but revise or delete
a very small proportion of the overall data across the whole timeline.
> 3. Not a Good Fit: many partial row updates or deletions over time: When you need to
partially revise or delete fields for rows that you read together. Also, when you revise or
delete rows within clustered reads.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message