Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 27 Nov 2014 02:05:12 +0000 (UTC)
From: "Weichen Ye (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12758108.1417053514000.31767.1417053912469@Atlassian.JIRA>
In-Reply-To: <JIRA.12758108.1417053514000@Atlassian.JIRA>
References: <JIRA.12758108.1417053514000@Atlassian.JIRA>
 <JIRA.12758108.1417053514565@arcas>
Subject: [jira] [Updated] (HBASE-12590) A solution for data skew in
 HBase-Mapreduce Job
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/HBASE-12590?page=3Dcom.atlassi=
an.jira.plugin.system.issuetabpanels:all-tabpanel ]

Weichen Ye updated HBASE-12590:
-------------------------------
    Attachment: A Solution for Data Skew in HBase-MapReduce Job.pdf

> A solution for data skew in HBase-Mapreduce Job=20
> ------------------------------------------------
>
>                 Key: HBASE-12590
>                 URL: https://issues.apache.org/jira/browse/HBASE-12590
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 2.0.0
>            Reporter: Weichen Ye
>         Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf
>
>
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table=
 always contains a lot of small regions and several large regions. Small re=
gions waste a lot of computing resources. If we use a job to scan a table w=
ith 3000 small regions, we need a job with 3000 mappers. Large regions alwa=
ys block the job. If in a 100-region table, one region is far larger then t=
he other 99 regions. When we run a job with the table as input, 99 mappers =
will be completed very quickly, and we need to wait for the last mapper for=
 a long time.
> 2, Configuration
> Add two new configuration.=20
> hbase.mapreduce.split.autobalance =3D true means enabling the =E2=80=9Cau=
to balance=E2=80=9D in HBase-MapReduce jobs. The default value is false.=20
> hbase.mapreduce.split.targetsize =3D 1073741824 (default 1GB). The target=
 size of mapreduce splits.=20
> If a region size is large than the target size, cut the region into two s=
plit.If the sum of several small continuous region size less than the targe=
t size, combine these regions into one split.
> Example:
> In attachment


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)