Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5FA69200D51 for ; Fri, 8 Dec 2017 05:28:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 5E131160C1E; Fri, 8 Dec 2017 04:28:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 85702160C0C for ; Fri, 8 Dec 2017 05:28:04 +0100 (CET) Received: (qmail 75735 invoked by uid 500); 8 Dec 2017 04:28:03 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 75723 invoked by uid 99); 8 Dec 2017 04:28:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Dec 2017 04:28:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 2992AC809C for ; Fri, 8 Dec 2017 04:28:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.211 X-Spam-Level: X-Spam-Status: No, score=-99.211 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 4ka60ZIcTRyu for ; Fri, 8 Dec 2017 04:28:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 3DEB05F576 for ; Fri, 8 Dec 2017 04:28:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 6395EE015F for ; Fri, 8 Dec 2017 04:28:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1AF6521E7B for ; Fri, 8 Dec 2017 04:28:00 +0000 (UTC) Date: Fri, 8 Dec 2017 04:28:00 +0000 (UTC) From: "Xiang Li (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 08 Dec 2017 04:28:05 -0000 [ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283035#comment-16283035 ] Xiang Li edited comment on HBASE-15482 at 12/8/17 4:27 AM: ----------------------------------------------------------- [~tedyu], thanks very much for your comments! patch 001 is uploaded to address your comments as well as the errors reported by checkstyle. * "hbase.TableSnapshotInputFormat.locality" is changed into "hbase.TableSnapshotInputFormat.locality.enable". * The truncation of locations is moved into getBestLocations(). * The errors reported by checkstyle are corrected. Regarding {{moving the truncation of locations into getBestLocations()}}: The code has different logic for different combinations of hostAndWeights.length and numTopsAtMost. And there is a small behavior change on getBestLocations() when hostAndWeights.length is 0: * Originally, it returns an empty list. * After the change, it returns null. I think we do not need to allocate an empty list here, as the locations will be used to construct TableSnapshotInputFormatImpl.InputSplit and null will be checked as follow {code:title=hbase/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormatImpl.java|borderStyle=solid} public InputSplit(TableDescriptor htd, HRegionInfo regionInfo, List locations, Scan scan, Path restoreDir) { this.htd = htd; this.regionInfo = regionInfo; if (locations == null || locations.isEmpty()) { // <--- here this.locations = new String[0]; } else { this.locations = locations.toArray(new String[locations.size()]); } try { this.scan = scan != null ? TableMapReduceUtil.convertScanToString(scan) : ""; } catch (IOException e) { LOG.warn("Failed to convert Scan to String", e); } this.restoreDir = restoreDir.toString(); } {code} And TableSnapshotInputFormatImpl is @InterfaceAudience.Private, there is no other calls of getBestLocations() in the whole HBase project except UTs. A UT is updated according to the change above. was (Author: water): [~tedyu], thanks very much for your comments! patch 001 is updated to address your comments as well as the errors reported by checkstyle. * "hbase.TableSnapshotInputFormat.locality" is changed into "hbase.TableSnapshotInputFormat.locality.enable". * The truncation of locations is moved into getBestLocations(). * The errors reported by checkstyle are corrected. Regarding {{moving the truncation of locations into getBestLocations()}}: The code has different logic for different combinations of hostAndWeights.length and numTopsAtMost. And there is a small behavior change on getBestLocations() when hostAndWeights.length is 0: * Originally, it returns an empty list. * After the change, it returns null. I think we do not need to allocate an empty list here, as the locations will be used to construct TableSnapshotInputFormatImpl.InputSplit and null will be checked as follow {code:title=hbase/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormatImpl.java|borderStyle=solid} public InputSplit(TableDescriptor htd, HRegionInfo regionInfo, List locations, Scan scan, Path restoreDir) { this.htd = htd; this.regionInfo = regionInfo; if (locations == null || locations.isEmpty()) { // <--- here this.locations = new String[0]; } else { this.locations = locations.toArray(new String[locations.size()]); } try { this.scan = scan != null ? TableMapReduceUtil.convertScanToString(scan) : ""; } catch (IOException e) { LOG.warn("Failed to convert Scan to String", e); } this.restoreDir = restoreDir.toString(); } {code} And TableSnapshotInputFormatImpl is @InterfaceAudience.Private, there is no other calls of getBestLocations() in the whole HBase project except UTs. A UT is updated according to the change above. > Provide an option to skip calculating block locations for SnapshotInputFormat > ----------------------------------------------------------------------------- > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce > Reporter: Liyin Tang > Assignee: Xiang Li > Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch, HBASE-15482.master.001.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the splits based on the block locations in order to get best locality. However, this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side of HBase cluster. In these scenarios, the block locality doesn't matter. Therefore, it will be great to have an option to skip calculating the block locations for every job. That will super useful for the Hive/Presto/Spark connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)