hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weichen Ye (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
Date Tue, 16 Dec 2014 12:23:13 GMT

     [ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Weichen Ye updated HBASE-12590:
    Attachment: HBASE-12590-v3.patch

Hi, would you please take a look at this new patch?

in the new patch:
1, re-design the function for getting split point in large region
2, add a new mode for binary keys. The default mode is for text keys. User can swith by setting
a new configuration: hbase.table.row.textkey
3, add new tests for both text keys and binary keys 

> A solution for data skew in HBase-Mapreduce Job
> -----------------------------------------------
>                 Key: HBASE-12590
>                 URL: https://issues.apache.org/jira/browse/HBASE-12590
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Weichen Ye
>         Attachments: A Solution for Data Skew in HBase-MapReduce Job (Version2).pdf,
HBASE-12590-v3.patch, HBase-12590-v1.patch, HBase-12590-v2.patch
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table always contains
a lot of small regions and several large regions. Small regions waste a lot of computing resources.
If we use a job to scan a table with 3000 small regions, we need a job with 3000 mappers.
Large regions always block the job. If in a 100-region table, one region is far larger then
the other 99 regions. When we run a job with the table as input, 99 mappers will be completed
very quickly, and we need to wait for the last mapper for a long time.
> 2, Configuration
> Add two new configuration. 
> hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in HBase-MapReduce
jobs. The default value is false. 
> hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size of mapreduce
> If a region size is large than the target size, cut the region into two split.If the
sum of several small continuous region size less than the target size, combine these regions
into one split.
> Example:
> In attachment
> Welcome to the Review Board.
> https://reviews.apache.org/r/28494/diff/#

This message was sent by Atlassian JIRA

View raw message