Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 77393200D1E for ; Wed, 4 Oct 2017 06:06:16 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 75DB3160BDA; Wed, 4 Oct 2017 04:06:16 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B88AF1609DE for ; Wed, 4 Oct 2017 06:06:15 +0200 (CEST) Received: (qmail 60654 invoked by uid 500); 4 Oct 2017 04:06:14 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 60621 invoked by uid 99); 4 Oct 2017 04:06:14 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Oct 2017 04:06:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 215D0C3CBE for ; Wed, 4 Oct 2017 04:06:13 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.502 X-Spam-Level: X-Spam-Status: No, score=-99.502 tagged_above=-999 required=6.31 tests=[KAM_NUMSUBJECT=0.5, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 1WkxTeoJRMTm for ; Wed, 4 Oct 2017 04:06:12 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 319205FDEC for ; Wed, 4 Oct 2017 04:06:12 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 1044EE0051 for ; Wed, 4 Oct 2017 04:06:10 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 0B56E24317 for ; Wed, 4 Oct 2017 04:06:05 +0000 (UTC) Date: Wed, 4 Oct 2017 04:06:05 +0000 (UTC) From: "Hudson (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 04 Oct 2017 04:06:16 -0000 [ https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16190767#comment-16190767 ] Hudson commented on HBASE-16894: -------------------------------- SUCCESS: Integrated in Jenkins build HBase-1.4 #940 (See [https://builds.apache.org/job/HBase-1.4/940/]) HBASE-16894 Create more than 1 split per region, generalize HBASE-12590 (apurtell: rev cbbcb2db2f0a94382cb33fef826cbf1a00b5de6e) * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/namespace/TestNamespaceAuditor.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java > Create more than 1 split per region, generalize HBASE-12590 > ----------------------------------------------------------- > > Key: HBASE-16894 > URL: https://issues.apache.org/jira/browse/HBASE-16894 > Project: HBase > Issue Type: Improvement > Affects Versions: 3.0.0, 2.0.0-alpha-2 > Reporter: Enis Soztutar > Assignee: Yi Liang > Fix For: 2.0.0, 3.0.0, 1.4.0, 1.5.0 > > Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, ImplementaionAndSomeQuestion.docx > > > A common request from users is to be able to better control how many map tasks are created per region. Right now, it is always 1 region = 1 input split = 1 map task. Same goes for Spark since it uses the TIF. With region sizes as large as 50 GBs, it is desirable to be able to create more than 1 split per region. > HBASE-12590 adds a config property for MR jobs to be able to handle skew in region sizes. The algorithm is roughly: > {code} > If (region size >= average size*ratio) : cut the region into two MR input splits > If (average size <= region size < average size*ratio) : one region as one MR input split > If (sum of several continuous regions size < average size * ratio): combine these regions into one MR input split. > {code} > Although we can set data skew ratio to be 0.5 or something to abuse HBASE-12590 into creating more than 1 split task per region, it is not ideal. But there is no way to create more with the patch as it is. For example we cannot create more than 2 tasks per region. > If we want to fix this properly, we should extend the approach in HBASE-12590, and make it so that the client can specify the desired num of mappers, or desired split size, and the TIF generates the splits based on the current region sizes very similar to the algorithm in HBASE-12590, but a more generic way. This also would eliminate the hand tuning of data skew ratio. > We also can think about the guidepost approach that Phoenix has in the stats table which is used for exactly this purpose. Right now, the region can be split into powers of two assuming uniform distribution within the region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)