hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-2375) Make decision to split based on aggregate size of all StoreFiles and revisit related config params
Date Thu, 09 Feb 2012 18:10:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204701#comment-13204701
] 

Jean-Daniel Cryans commented on HBASE-2375:
-------------------------------------------

A bunch of things changed since this jira was created: 
 - we now split based on the store size 
 - regions split at 1GB 
 - memstores flush at 128MB 
 - there's been a lot of work on tuning the store file selection algorithm 

My understanding of this jira is that it aims at making the "out of the box mass import" experience
better. Now that we have bulk loads and pre-splitting this use case is becoming less and less
important... although we still see people trying to benchmark it (hi hypertable). 

I see three things we could do:
 - Trigger splits after flushes, I hacked a patch and it works awesomely
 - Have a lower split size for newly created tables. Hypertable does this with a soft limit
that gets doubled every time the table splits until it reaches the normal split size
 - Have multi-way splits (Todd's idea), so that if you have enough data that you know you're
going to be splitting after the current split then just spawn as many daughters as you need.

 I'm planning on just fixing the first bullet point in the context of this jira. Maybe there's
another stuff from the patch in this jira that we could fit in.
                
> Make decision to split based on aggregate size of all StoreFiles and revisit related
config params
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2375
>                 URL: https://issues.apache.org/jira/browse/HBASE-2375
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 0.20.3
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Critical
>              Labels: moved_from_0_20_5
>         Attachments: HBASE-2375-v8.patch
>
>
> Currently we will make the decision to split a region when a single StoreFile in a single
family exceeds the maximum region size.  This issue is about changing the decision to split
to be based on the aggregate size of all StoreFiles in a single family (but still not aggregating
across families).  This would move a check to split after flushes rather than after compactions.
 This issue should also deal with revisiting our default values for some related configuration
parameters.
> The motivating factor for this change comes from watching the behavior of RegionServers
during heavy write scenarios.
> Today the default behavior goes like this:
> - We fill up regions, and as long as you are not under global RS heap pressure, you will
write out 64MB (hbase.hregion.memstore.flush.size) StoreFiles.
> - After we get 3 StoreFiles (hbase.hstore.compactionThreshold) we trigger a compaction
on this region.
> - Compaction queues notwithstanding, this will create a 192MB file, not triggering a
split based on max region size (hbase.hregion.max.filesize).
> - You'll then flush two more 64MB MemStores and hit the compactionThreshold and trigger
a compaction.
> - You end up with 192 + 64 + 64 in a single compaction.  This will create a single 320MB
and will trigger a split.
> - While you are performing the compaction (which now writes out 64MB more than the split
size, so is about 5X slower than the time it takes to do a single flush), you are still taking
on additional writes into MemStore.
> - Compaction finishes, decision to split is made, region is closed.  The region now has
to flush whichever edits made it to MemStore while the compaction ran.  This flushing, in
our tests, is by far the dominating factor in how long data is unavailable during a split.
 We measured about 1 second to do the region closing, master assignment, reopening.  Flushing
could take 5-6 seconds, during which time the region is unavailable.
> - The daughter regions re-open on the same RS.  Immediately when the StoreFiles are opened,
a compaction is triggered across all of their StoreFiles because they contain references.
 Since we cannot currently split a split, we need to not hang on to these references for long.
> This described behavior is really bad because of how often we have to rewrite data onto
HDFS.  Imports are usually just IO bound as the RS waits to flush and compact.  In the above
example, the first cell to be inserted into this region ends up being written to HDFS 4 times
(initial flush, first compaction w/ no split decision, second compaction w/ split decision,
third compaction on daughter region).  In addition, we leave a large window where we take
on edits (during the second compaction of 320MB) and then must make the region unavailable
as we flush it.
> If we increased the compactionThreshold to be 5 and determined splits based on aggregate
size, the behavior becomes:
> - We fill up regions, and as long as you are not under global RS heap pressure, you will
write out 64MB (hbase.hregion.memstore.flush.size) StoreFiles.
> - After each MemStore flush, we calculate the aggregate size of all StoreFiles.  We can
also check the compactionThreshold.  For the first three flushes, both would not hit the limit.
 On the fourth flush, we would see total aggregate size = 256MB and determine to make a split.
> - Decision to split is made, region is closed.  This time, the region just has to flush
out whichever edits made it to the MemStore during the snapshot/flush of the previous MemStore.
 So this time window has shrunk by more than 75% as it was the time to write 64MB from memory
not 320MB from aggregating 5 hdfs files.  This will greatly reduce the time data is unavailable
during splits.
> - The daughter regions re-open on the same RS.  Immediately when the StoreFiles are opened,
a compaction is triggered across all of their StoreFiles because they contain references.
 This would stay the same.
> In this example, we only write a given cell twice (instead of 4 times) while drastically
reducing data unavailability during splits.  On the original flush, and post-split to remove
references.  The other benefit of post-split compaction (which doesn't change) is that we
then get good data locality as the resulting StoreFile will be written to the local DataNode.
 In another jira, we should deal with opening up one of the daughter regions on a different
RS to distribute load better, but that's outside the scope of this one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message