hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vladimir Rodionov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-14477) Compaction improvements: Date tiered compaction policy
Date Thu, 01 Oct 2015 21:36:26 GMT

    [ https://issues.apache.org/jira/browse/HBASE-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940458#comment-14940458
] 

Vladimir Rodionov commented on HBASE-14477:
-------------------------------------------

After internal discussion with peers we have agreed that users can be confused in configuring
Generational compaction and similar DateTieredCompaction is better alternative. So, renamed
the JIRA.

> Compaction improvements: Date tiered compaction policy
> ------------------------------------------------------
>
>                 Key: HBASE-14477
>                 URL: https://issues.apache.org/jira/browse/HBASE-14477
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Vladimir Rodionov
>            Assignee: Vladimir Rodionov
>             Fix For: 2.0.0
>
>
> For immutable and mostly immutable data the current SizeTiered-based compaction policy
is not efficient. 
> # There is no need to compact all files into one, because, data is (mostly) immutable
and we do not need to collect garbage. (performance reason will be discussed later)
> # Size-tiered compaction is not suitable for applications where most recent data is most
important and prevents efficient caching of this data. 
> The idea  is pretty similar to DateTieredCompaction in Cassandra:
> http://www.datastax.com/dev/blog/datetieredcompactionstrategy
> http://www.datastax.com/dev/blog/dtcs-notes-from-the-field
> From Cassandra own blog:
> {quote}
> Since DTCS can be used with any table, it is important to know when it is a good idea,
and when it is not. I’ll try to explain the spectrum and trade-offs here:
> 1. Perfect Fit: Time Series Fact Data, Deletes by Default TTL: When you ingest fact data
that is ordered in time, with no deletes or overwrites. This is the standard “time series”
use case.
> 2. OK Fit: Time-Ordered, with limited updates across whole data set, or only updates
to recent data: When you ingest data that is (mostly) ordered in time, but revise or delete
a very small proportion of the overall data across the whole timeline.
> 3. Not a Good Fit: many partial row updates or deletions over time: When you need to
partially revise or delete fields for rows that you read together. Also, when you revise or
delete rows within clustered reads.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message