Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7EC0B200CC8 for ; Fri, 30 Jun 2017 03:09:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 7DFD5160BED; Fri, 30 Jun 2017 01:09:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9C8DF160BF7 for ; Fri, 30 Jun 2017 03:09:05 +0200 (CEST) Received: (qmail 31692 invoked by uid 500); 30 Jun 2017 01:09:04 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 31683 invoked by uid 99); 30 Jun 2017 01:09:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Jun 2017 01:09:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 6EAEA1A0228 for ; Fri, 30 Jun 2017 01:09:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id IPdKoORzDmzb for ; Fri, 30 Jun 2017 01:09:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id D147F5F613 for ; Fri, 30 Jun 2017 01:09:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C98F5E0BC8 for ; Fri, 30 Jun 2017 01:09:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 14FE7245BA for ; Fri, 30 Jun 2017 01:09:00 +0000 (UTC) Date: Fri, 30 Jun 2017 01:09:00 +0000 (UTC) From: "Sailee Jain (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-16999) Performance bottleneck in the add_resource api MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 30 Jun 2017 01:09:06 -0000 [ https://issues.apache.org/jira/browse/HIVE-16999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sailee Jain updated HIVE-16999: ------------------------------- Description: Performance bottleneck is found in adding resource[lying on hdfs] to the distributed cache. Commands used are :- 1. ADD ARCHIVE "{color:#d04437}hdfs{color}://some_dir/archive.tar" 2. ADD FILE "{color:#d04437}hdfs{color}://some_dir/file.txt" Here is the log corresponding to the archive adding operation:- => converting to local hdfs://some_dir/archive.tar => Added resources: [hdfs://some_dir/archive.tar] Hive is downloading the resource to the local filesystem [shown in log by "converting to local"]. Ideally there is no need to bring the file to the local filesystem when this operation is all about copying the file from one location on HDFS to other location on HDFS[distributed cache]. This adds lot of performance bottleneck when the the resource is a big file and all commands need the same resource. After debugging around the impacted piece of code is found to be :- public List add_resources(ResourceType t, Collection values, boolean convertToUnix) throws RuntimeException { Set resourceSet = resourceMaps.getResourceSet(t); Map> resourcePathMap = resourceMaps.getResourcePathMap(t); Map> reverseResourcePathMap = resourceMaps.getReverseResourcePathMap(t); List localized = new ArrayList(); try { for (String value : values) { String key; {color:#d04437}//get the local path of downloaded jars.{color} List downloadedURLs = resolveAndDownload(t, value, convertToUnix); ; . bq. List {color:#d04437}resolveAndDownload{color}(ResourceType t, String value, boolean convertToUnix) throws URISyntaxException, bq. IOException { bq. URI uri = createURI(value); bq. if (getURLType(value).equals("file")) { bq. return Arrays.asList(uri); bq. } else if (getURLType(value).equals("ivy")) { bq. return dependencyResolver.downloadDependencies(uri); bq. } else {{color:#d04437} // goes here for HDFS{color} bq. {color:#d04437}return Arrays.asList(createURI(downloadResource(value, convertToUnix)));{color} bq. } bq. } Thanks, Sailee was: Performance bottleneck is found in adding resource[lying on hdfs] to the distributed cache. Commands used are :- {{1. ADD ARCHIVE "{color:#d04437}hdfs{color}://some_dir/archive.tar" 2. ADD FILE "{color:#d04437}hdfs{color}://some_dir/file.txt"}} Here is the log corresponding to the archive adding operation:- => converting to local hdfs://some_dir/archive.tar => Added resources: [hdfs://some_dir/archive.tar] Hive is downloading the resource to the local filesystem [shown in log by "converting to local"]. Ideally there is no need to bring the file to the local filesystem when this operation is all about copying the file from one location on HDFS to other location on HDFS[distributed cache]. This adds lot of performance bottleneck when the the resource is a big file and all commands need the same resource. After debugging around the impacted piece of code is found to be :- {{public List add_resources(ResourceType t, Collection values, boolean convertToUnix) throws RuntimeException { Set resourceSet = resourceMaps.getResourceSet(t); Map> resourcePathMap = resourceMaps.getResourcePathMap(t); Map> reverseResourcePathMap = resourceMaps.getReverseResourcePathMap(t); List localized = new ArrayList(); try { for (String value : values) { String key; {color:#d04437}//get the local path of downloaded jars.{color} List downloadedURLs = resolveAndDownload(t, value, convertToUnix); ; .}} {{ List {color:#d04437}resolveAndDownload{color}(ResourceType t, String value, boolean convertToUnix) throws URISyntaxException, IOException { URI uri = createURI(value); if (getURLType(value).equals("file")) { return Arrays.asList(uri); } else if (getURLType(value).equals("ivy")) { return dependencyResolver.downloadDependencies(uri); } else {{color:#d04437} // goes here for HDFS{color} {color:#d04437}return Arrays.asList(createURI(downloadResource(value, convertToUnix)));{color} } }}} Thanks, Sailee > Performance bottleneck in the add_resource api > ---------------------------------------------- > > Key: HIVE-16999 > URL: https://issues.apache.org/jira/browse/HIVE-16999 > Project: Hive > Issue Type: Bug > Components: Hive > Reporter: Sailee Jain > Priority: Critical > > Performance bottleneck is found in adding resource[lying on hdfs] to the distributed cache. > Commands used are :- > 1. ADD ARCHIVE "{color:#d04437}hdfs{color}://some_dir/archive.tar" > 2. ADD FILE "{color:#d04437}hdfs{color}://some_dir/file.txt" > Here is the log corresponding to the archive adding operation:- > => converting to local hdfs://some_dir/archive.tar > => Added resources: [hdfs://some_dir/archive.tar] > Hive is downloading the resource to the local filesystem [shown in log by "converting to local"]. > Ideally there is no need to bring the file to the local filesystem when this operation is all about copying the file from one location on HDFS to other location on HDFS[distributed cache]. > This adds lot of performance bottleneck when the the resource is a big file and all commands need the same resource. > After debugging around the impacted piece of code is found to be :- > public List add_resources(ResourceType t, Collection values, boolean convertToUnix) > throws RuntimeException { > Set resourceSet = resourceMaps.getResourceSet(t); > Map> resourcePathMap = resourceMaps.getResourcePathMap(t); > Map> reverseResourcePathMap = resourceMaps.getReverseResourcePathMap(t); > List localized = new ArrayList(); > try { > for (String value : values) { > String key; > {color:#d04437}//get the local path of downloaded jars.{color} > List downloadedURLs = resolveAndDownload(t, value, convertToUnix); > ; > . > bq. List {color:#d04437}resolveAndDownload{color}(ResourceType t, String value, boolean convertToUnix) throws URISyntaxException, > bq. IOException { > bq. URI uri = createURI(value); > bq. if (getURLType(value).equals("file")) { > bq. return Arrays.asList(uri); > bq. } else if (getURLType(value).equals("ivy")) { > bq. return dependencyResolver.downloadDependencies(uri); > bq. } else {{color:#d04437} // goes here for HDFS{color} > bq. {color:#d04437}return Arrays.asList(createURI(downloadResource(value, convertToUnix)));{color} > bq. } > bq. } > Thanks, > Sailee -- This message was sent by Atlassian JIRA (v6.4.14#64029)