hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HBASE-5138) [ref manual] Add a discussion on the number of regions
Date Sun, 20 Jan 2013 23:12:14 GMT

     [ https://issues.apache.org/jira/browse/HBASE-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

stack resolved HBASE-5138.
--------------------------

       Resolution: Fixed
    Fix Version/s: 0.96.0
     Hadoop Flags: Reviewed

Committed the text under number of regions discussion that already exists in refguide in configuration
section.
                
> [ref manual] Add a discussion on the number of regions
> ------------------------------------------------------
>
>                 Key: HBASE-5138
>                 URL: https://issues.apache.org/jira/browse/HBASE-5138
>             Project: HBase
>          Issue Type: Task
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.96.0
>
>
> ntelford on IRC made the good point that we say people shouldn't have too many regions,
but we don't say why. His problem currently is:
> {quote}
> 09:21 < ntelford> problem is, if you're running MR jobs on a subset of that data,
you need the regions to be as small as possible otherwise tasks don't get allocated in parallel
much
> 09:22 < ntelford> so we've found we have to strike a balance between keeping them
small for MR and keeping them large for HBase to behave well
> 09:22 < ntelford> we erred on the side of smaller regions because our MR issues
were more immediate - we couldn't find any documentation or anecdotal evidence as to why HBase
doesn't like lots of regions
> {quote}
> The three main issues I can think of when having too many regions are:
>  - mslab requires 2mb per memstore (that's 2mb per family per region). 1000 regions that
have 2 families each is 3.9GB of heap used, and it's not even storing data yet. NB: the 2MB
value is configurable.
>  - if you fill all the regions at somewhat the same rate, the global memory usage makes
it that it forces tiny flushes when you have too many regions which in turn generates compactions.
Rewriting the same data tens of times is the last thing you want. An example is filling 1000
regions (with one family) equally and let's consider a lower bound for global memstore usage
of 5GB (the region server would have a big heap). Once it reaches 5GB it will force flush
the biggest region, at that point they should almost all have about 5MB of data so it would
flush that amount. 5MB inserted later, it would flush another region that will now have a
bit over 5MB of data, and so on.
>  - the new master is allergic to tons of regions, and will take a lot of time assigning
them and moving them around in batches. The reason is that it's heavy on ZK usage, and it's
not very async at the moment (could really be improved).
> Another issue is the effect of the number of regions on mapreduce jobs. Keeping 5 regions
per RS would be too low for a job, whereas 1000 will generate too many maps. This comes back
to ntelford's problem of needing to scan portions of tables. To solve his problem, we discussed
using a custom input format that generates many splits per region.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message