hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sheetal Dolas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-12324) Improve compaction speed and process for immutable short lived datasets
Date Fri, 24 Oct 2014 19:02:35 GMT

    [ https://issues.apache.org/jira/browse/HBASE-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183287#comment-14183287

Sheetal Dolas commented on HBASE-12324:

Sean, Vlad,

Thanks for your inputs.

[~vrodionov], in our case already had all those params tuned , however the expired data must
get deleted. Which utility are you referring to ? Can one run that while tables are active
and data being ingested?
IMO Adding external utilities is error prone and operational overhead. So it would be nice
if it is inside HBase. Also as [~busbey] pointed out, tuning these parameter needs careful
evaluation and need for niche expertise.

It would be nice if HBase itself can take care of complexities and make it easy for users/operators.
I can see multiple use cases including Open TSDB which need this to be handled elegantly.

Let me add some more details to the use case and proposed solution.
Use case:
* Very high ingest rate.
* Immutable data
* Data life is short (few days)
* Read rates are low to moderate (in comparison to ingest rates)

Issues with default major compaction (even when compactions are done rarely)
* Lot of data IO just to get out expired data out
* No other significant benefits then expired data deletion

Proposed solution
* During major (or even minor) compactions, do not compact any data
* Just delete files whose timestamp is older than TTL
* Add a new compaction policy class say "OnlyDeleteExpiredFilesCompactionPolicy" and set these
configurations while creating the table.
'hbase.hstore.defaultengine.compactionpolicy.class' => 'org.apache.hadoop.hbase.regionserver.compactions.OnlyDeleteExpiredFilesCompactionPolicy',
'hbase.store.delete.expired.storefile' => 'true' 

* Significant reduction in IO during compaction
* Automatically get rid of expired data

Assumptions and applicability
* TTL is defined at table level or for all CFs in table
* Cells use system timestamp for versioning or if overwritten, the overwritten timestamp is
closer to system timestamp

Attached proposed compaction policy. It appears trivially simple. Thoughts?

> Improve compaction speed and process for immutable short lived datasets
> -----------------------------------------------------------------------
>                 Key: HBASE-12324
>                 URL: https://issues.apache.org/jira/browse/HBASE-12324
>             Project: HBase
>          Issue Type: New Feature
>          Components: Compaction
>    Affects Versions: 0.98.0, 0.96.0
>            Reporter: Sheetal Dolas
> We have seen multiple cases where HBase is used to store immutable data and the data
lives for short period of time (few days)
> On very high volume systems, major compactions become very costly and slowdown ingestion
> In all such use cases (immutable data, high write rate and moderate read rates and shorter
ttl), avoiding any compactions and just deleting old data brings lot of performance benefits.
> We should have a compaction policy that can only delete/archive files older than TTL
and not compact any files.
> Also attaching a patch that can do so.

This message was sent by Atlassian JIRA

View raw message