hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yi Liang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590
Date Fri, 02 Dec 2016 17:26:59 GMT

    [ https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15715737#comment-15715737

Yi Liang commented on HBASE-16894:

After discuss with Enis, we have some ideas for this jira,
 (1) Given the requirements in HBASE-16894, generalize the logic of doing more than 1 mapper
/ reducer per region. This will use the RegionSplitter to calculate the split points within
the regions if needed. This will assume uniform distribution, and will calculate split points
from just the start / end keys, not the actual data. 
 (2) Find a way to calculate split points from existing data. 

Notice that doing (1) does not require (2) but (1) will give us a lot of benefits even without
(2). So, I would suggest we should concentrate on doing the patch for (1) in HBASE-16894.
Again, we won't use any statistics or ask the regionservers, etc for this. We will use just
the region start / end key boundaries and uniform distribution. 

After doing (1), we can create a follow up jira to do (2). There is already some work that
is recent in HBASE-16169. This makes it so that the client asks every regionserver for the
region size calculation. I think we can piggy-back on this and have the region server return
some split points based on the actual indexes as well as a part of the RegionLoad object.
The Region itself can maintain a small list of guideposts and make that available in the RegionLoad
object when asked. This way we won't have to maintain this info externally in a table or something,
but it will still be available on demand.  

Again, I think we should do the patch for (1) first, then only after that focus on solving

> Create more than 1 split per region, generalize HBASE-12590
> -----------------------------------------------------------
>                 Key: HBASE-16894
>                 URL: https://issues.apache.org/jira/browse/HBASE-16894
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Enis Soztutar
>            Assignee: Yi Liang
>              Labels: beginner, beginners
> A common request from users is to be able to better control how many map tasks are created
per region. Right now, it is always 1 region = 1 input split = 1 map task. Same goes for Spark
since it uses the TIF. With region sizes as large as 50 GBs, it is desirable to be able to
create more than 1 split per region.
> HBASE-12590 adds a config property for MR jobs to be able to handle skew in region sizes.
The algorithm is roughly: 
> {code}
> If (region size >= average size*ratio) : cut the region into two MR input splits
> If (average size <= region size < average size*ratio) : one region as one MR input
> If (sum of several continuous regions size < average size * ratio): combine these
regions into one MR input split.
> {code}
> Although we can set data skew ratio to be 0.5 or something to abuse HBASE-12590 into
creating more than 1 split task per region, it is not ideal. But there is no way to create
more with the patch as it is. For example we cannot create more than 2 tasks per region. 
> If we want to fix this properly, we should extend the approach in HBASE-12590, and make
it so that the client can specify the desired num of mappers, or desired split size, and the
TIF generates the splits based on the current region sizes very similar to the algorithm in
HBASE-12590, but a more generic way. This also would eliminate the hand tuning of data skew
> We also can think about the guidepost approach that Phoenix has in the stats table which
is used for exactly this purpose. Right now, the region can be split into powers of two assuming
uniform distribution within the region. 

This message was sent by Atlassian JIRA

View raw message