hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Corgan (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5479) Postpone CompactionSelection to compaction execution time
Date Mon, 27 Feb 2012 19:18:46 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217410#comment-13217410

Matt Corgan commented on HBASE-5479:

{quote}you need to do a bulk import MR (vs Put-based) or you have your compaction algorithm
tuned incorrectly... you probably want to switch your compaction ratio to 0.125 and play with
it from there{quote}
yeah, just using it as an opportunity to push HBase with real data to see what breaks first.
 i hesitate to change the global compaction ratio when it's just a couple out of ~20 tables

Agree pluggable compaction strategies would be great, as would many other per-CF settings.
 Making them pluggable would be far more useful than perfecting a general algorithm.

Is there a quick fix that could deal with outdated requests?  Like ignoring a CompactionRequest
if the files in its CompactionSelection are not all there.  Or when pulling a CompactionRequest
from the head of the queue, iterate the entire queue to check if there's a newer CompactionRequest
for the same Store.
> Postpone CompactionSelection to compaction execution time
> ---------------------------------------------------------
>                 Key: HBASE-5479
>                 URL: https://issues.apache.org/jira/browse/HBASE-5479
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>            Reporter: Matt Corgan
> It can be commonplace for regionservers to develop long compaction queues, meaning a
CompactionRequest may execute hours after it was created.  The CompactionRequest holds a CompactionSelection
that was selected at request time but may no longer be the optimal selection.  The CompactionSelection
should be created at compaction execution time rather than compaction request time.
> The current mechanism breaks down during high volume insertion.  The inefficiency is
clearest when the inserts are finished.  Inserting for 5 hours may build up 50 storefiles
and a 40 element compaction queue.  When finished inserting, you would prefer that the next
compaction merges all 50 files (or some large subset), but the current system will churn through
each of the 40 compaction requests, the first of which may be hours old.  This ends up re-compacting
the same data many times.  
> The current system is especially inefficient when dealing with time series data where
the data in the storefiles has minimal overlap.  With time series data, there is even less
benefit to intermediate merges because most storefiles can be eliminated based on their key
range during a read, even without bloomfilters.  The only goal should be to reduce file count,
not to minimize number of files merged for each read.
> There are other aspects to the current queuing mechanism that would need to be looked
at.  You would want to avoid having the same Store in the queue multiple times.  And you would
want the completion of one compaction to possibly queue another compaction request for the
> A alternative architecture to the current style of queues would be to have each Store
(all open in memory) keep a compactionPriority score up to date after events like flushes,
compactions, schema changes, etc.  Then you create a "CompactionPriorityComparator implements
Comparator<Store>" and stick all the Stores into a PriorityQueue (synchronized remove/add
from the queue when the value changes).  The async compaction threads would keep pulling off
the head of that queue as long as the head has compactionPriority > X.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message