hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weichen Ye (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job
Date Wed, 17 Dec 2014 06:02:13 GMT

     [ https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Weichen Ye updated HBASE-12590:
-------------------------------
    Description: 
1, Motivation
In production environment, data skew is a very common case. A HBase table may contains a lot
of small regions and several large regions. Small regions waste a lot of computing resources.
If we use a job to scan a table with 3000 small regions, we need a job with 3000 mappers.
Large regions always block the job. If in a 100-region table, one region is far large then
the other 99 regions. When we run a job with the table as input, 99 mappers will be completed
very quickly, and then we need to wait for the last mapper for a long time.

2, Configuration
Add three new configuration 
hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in HBase-MapReduce
jobs. The default value is false. 
hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region size is larger
than 3x average region size, treat the region as “proportionately too large”.
hbase.table.row.textkey  = true means the row key is text. False means binary row key. It
is used to find the mid row key in large region. The default value is true. 
If (region size >= average size*ratio) :  cut the region into two MR input splits
If (average size <= region size < average size*ratio) : one region as one MR input split
If (sum of several continuous regions size < average size): combine these regions into
one MR input split.


Example:
In attachment

Welcome to the Review Board.
https://reviews.apache.org/r/28494/diff/#



  was:
1, Motivation
In production environment, data skew is a very common case. A HBase table always contains
a lot of small regions and several large regions. Small regions waste a lot of computing resources.
If we use a job to scan a table with 3000 small regions, we need a job with 3000 mappers.
Large regions always block the job. If in a 100-region table, one region is far larger then
the other 99 regions. When we run a job with the table as input, 99 mappers will be completed
very quickly, and we need to wait for the last mapper for a long time.

2, Configuration
Add two new configuration. 
hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in HBase-MapReduce
jobs. The default value is false. 
hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size of mapreduce
splits. 
If a region size is large than the target size, cut the region into two split.If the sum of
several small continuous region size less than the target size, combine these regions into
one split.

Example:
In attachment

Welcome to the Review Board.
https://reviews.apache.org/r/28494/diff/#




> A solution for data skew in HBase-Mapreduce Job
> -----------------------------------------------
>
>                 Key: HBASE-12590
>                 URL: https://issues.apache.org/jira/browse/HBASE-12590
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Weichen Ye
>         Attachments: A Solution for Data Skew in HBase-MapReduce Job (Version2).pdf,
A Solution for Data Skew in HBase-MapReduce Job (Version3).pdf, HBASE-12590-v3.patch, HBase-12590-v1.patch,
HBase-12590-v2.patch
>
>
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table may contains
a lot of small regions and several large regions. Small regions waste a lot of computing resources.
If we use a job to scan a table with 3000 small regions, we need a job with 3000 mappers.
Large regions always block the job. If in a 100-region table, one region is far large then
the other 99 regions. When we run a job with the table as input, 99 mappers will be completed
very quickly, and then we need to wait for the last mapper for a long time.
> 2, Configuration
> Add three new configuration 
> hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in HBase-MapReduce
jobs. The default value is false. 
> hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region size is
larger than 3x average region size, treat the region as “proportionately too large”.
> hbase.table.row.textkey  = true means the row key is text. False means binary row key.
It is used to find the mid row key in large region. The default value is true. 
> If (region size >= average size*ratio) :  cut the region into two MR input splits
> If (average size <= region size < average size*ratio) : one region as one MR input
split
> If (sum of several continuous regions size < average size): combine these regions
into one MR input split.
> Example:
> In attachment
> Welcome to the Review Board.
> https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message