hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elliott Clark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-7842) Add compaction policy that explores more storefile groups
Date Thu, 14 Feb 2013 07:42:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13578203#comment-13578203

Elliott Clark commented on HBASE-7842:

bq.what I don't understand is the first new condition for every file. E.g. (assume ratio 1)
in order 10 7 4 5 is good to compact but 7 10 4 5 is not

We're trying to use size as a way to group files that are similar.  If there's a use case
that has traffic come in waves we want to group the smaller files up to create larger files
before compacting the larger files.  

For something like:

[100 100 100 50 50 50 50 50 100 100]

I'd rather have the 50's choosen and then compact the 100's later as they seem more similar
and a good way to try and not re-write the same data over and over again.

You are correct the main point that I want to dive into is that we are not looking at many
of the possible groupings right now, and it seems like a deep search would be cheap and could
find better compactions.

One of the things Stack and I thought about was not using the ratio at all for grouping files.
 Instead using it for deciding if we found a compaction that's good enough.

Something like:
ratio = 1.2
files to compact = 5
sum of file size = 80
average store file size (across all files in the store) = 20

(80 / 20 ) / 5 < 1.2 so yes we compact.

That's interesting and probably something that I'll try.  However I wanted to start with something
that's a tweak of the algorithm that we have now and then branch out.

bq.Back of the napkin calculation tells me that dumb exploration of ALL ordered permutations
should be fast

Yep.  The runtime of choosing files isn't something that should be a major concern as long
as things are not spiraling out of control.
> Add compaction policy that explores more storefile groups
> ---------------------------------------------------------
>                 Key: HBASE-7842
>                 URL: https://issues.apache.org/jira/browse/HBASE-7842
>             Project: HBase
>          Issue Type: New Feature
>          Components: Compaction
>    Affects Versions: 0.96.0
>            Reporter: Elliott Clark
>            Assignee: Elliott Clark
> Some workloads that are not as stable can have compactions that are too large or too
small using the current storefile selection algorithm.
> Currently:
> * Find the first file that Size(fi) <= Sum(0, i-1, FileSize(fx))
> * Ensure that there are the min number of files (if there aren't then bail out)
> * If there are too many files keep the larger ones.
> I would propose something like:
> * Find all sets of storefiles where every file satisfies 
> ** FileSize(fi) <= Sum(0, i-1, FileSize(fx))
> ** Num files in set =< max
> ** Num Files in set >= min
> * Then pick the set of files that maximizes ((# storefiles in set) / Sum(FileSize(fx)))
> The thinking is that the above algorithm is pretty easy reason about, all files satisfy
the ratio, and should rewrite the least amount of data to get the biggest impact in seeks.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message