Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Fri, 30 Jun 2017 01:09:00 +0000 (UTC)
From: "Sailee Jain (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.13083648.1498784833000.140615.1498784940085@Atlassian.JIRA>
In-Reply-To: <JIRA.13083648.1498784833000@Atlassian.JIRA>
References: <JIRA.13083648.1498784833000@Atlassian.JIRA> <JIRA.13083648.1498784833594@jira-lw-us.apache.org>
Subject: [jira] [Updated] (HIVE-16999) Performance bottleneck in the
 add_resource api
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 30 Jun 2017 01:09:06 -0000


     [ https://issues.apache.org/jira/browse/HIVE-16999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sailee Jain updated HIVE-16999:
-------------------------------
    Description: 
Performance bottleneck is found in adding resource[lying on hdfs] to the distributed cache. 
Commands used are :-
1. ADD ARCHIVE "{color:#d04437}hdfs{color}://some_dir/archive.tar"
2. ADD FILE "{color:#d04437}hdfs{color}://some_dir/file.txt"
Here is the log corresponding to the archive adding operation:-
=> converting to local hdfs://some_dir/archive.tar
=> Added resources: [hdfs://some_dir/archive.tar]

Hive is downloading the resource to the local filesystem [shown in log by "converting to local"]. 
Ideally there is no need to bring the file to the local filesystem when this operation is all about copying the file from one location on HDFS to other location on HDFS[distributed cache].
This adds lot of performance bottleneck when the the resource is a big file and all commands need the same resource.
After debugging around the impacted piece of code is found to be :-

public List<String> add_resources(ResourceType t, Collection<String> values, boolean convertToUnix)
      throws RuntimeException {
    Set<String> resourceSet = resourceMaps.getResourceSet(t);
    Map<String, Set<String>> resourcePathMap = resourceMaps.getResourcePathMap(t);
    Map<String, Set<String>> reverseResourcePathMap = resourceMaps.getReverseResourcePathMap(t);
    List<String> localized = new ArrayList<String>();
    try {
      for (String value : values) {
        String key;
        {color:#d04437}//get the local path of downloaded jars.{color}
        List<URI> downloadedURLs = resolveAndDownload(t, value, convertToUnix);
         ;
	.
bq.   List<URI> {color:#d04437}resolveAndDownload{color}(ResourceType t, String value, boolean convertToUnix) throws URISyntaxException,
bq.       IOException {
bq.     URI uri = createURI(value);
bq.     if (getURLType(value).equals("file")) {
bq.       return Arrays.asList(uri);
bq.     } else if (getURLType(value).equals("ivy")) {
bq.       return dependencyResolver.downloadDependencies(uri);
bq.     } else {{color:#d04437} // goes here for HDFS{color}
bq.       {color:#d04437}return Arrays.asList(createURI(downloadResource(value, convertToUnix)));{color} 
bq.     }
bq.   }

Thanks,
Sailee

  was:
Performance bottleneck is found in adding resource[lying on hdfs] to the distributed cache. 
Commands used are :-
{{1. ADD ARCHIVE "{color:#d04437}hdfs{color}://some_dir/archive.tar"
2. ADD FILE "{color:#d04437}hdfs{color}://some_dir/file.txt"}}
Here is the log corresponding to the archive adding operation:-
=> converting to local hdfs://some_dir/archive.tar
=> Added resources: [hdfs://some_dir/archive.tar]

Hive is downloading the resource to the local filesystem [shown in log by "converting to local"]. 
Ideally there is no need to bring the file to the local filesystem when this operation is all about copying the file from one location on HDFS to other location on HDFS[distributed cache].
This adds lot of performance bottleneck when the the resource is a big file and all commands need the same resource.
After debugging around the impacted piece of code is found to be :-

{{public List<String> add_resources(ResourceType t, Collection<String> values, boolean convertToUnix)
      throws RuntimeException {
    Set<String> resourceSet = resourceMaps.getResourceSet(t);
    Map<String, Set<String>> resourcePathMap = resourceMaps.getResourcePathMap(t);
    Map<String, Set<String>> reverseResourcePathMap = resourceMaps.getReverseResourcePathMap(t);
    List<String> localized = new ArrayList<String>();
    try {
      for (String value : values) {
        String key;
        {color:#d04437}//get the local path of downloaded jars.{color}
        List<URI> downloadedURLs = resolveAndDownload(t, value, convertToUnix);
         ;
	.}}
{{  List<URI> {color:#d04437}resolveAndDownload{color}(ResourceType t, String value, boolean convertToUnix) throws URISyntaxException,
      IOException {
    URI uri = createURI(value);
    if (getURLType(value).equals("file")) {
      return Arrays.asList(uri);
    } else if (getURLType(value).equals("ivy")) {
      return dependencyResolver.downloadDependencies(uri);
    } else {{color:#d04437} // goes here for HDFS{color}
      {color:#d04437}return Arrays.asList(createURI(downloadResource(value, convertToUnix)));{color} 
    }
  }}}

Thanks,
Sailee


> Performance bottleneck in the add_resource api
> ----------------------------------------------
>
>                 Key: HIVE-16999
>                 URL: https://issues.apache.org/jira/browse/HIVE-16999
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: Sailee Jain
>            Priority: Critical
>
> Performance bottleneck is found in adding resource[lying on hdfs] to the distributed cache. 
> Commands used are :-
> 1. ADD ARCHIVE "{color:#d04437}hdfs{color}://some_dir/archive.tar"
> 2. ADD FILE "{color:#d04437}hdfs{color}://some_dir/file.txt"
> Here is the log corresponding to the archive adding operation:-
> => converting to local hdfs://some_dir/archive.tar
> => Added resources: [hdfs://some_dir/archive.tar]
> Hive is downloading the resource to the local filesystem [shown in log by "converting to local"]. 
> Ideally there is no need to bring the file to the local filesystem when this operation is all about copying the file from one location on HDFS to other location on HDFS[distributed cache].
> This adds lot of performance bottleneck when the the resource is a big file and all commands need the same resource.
> After debugging around the impacted piece of code is found to be :-
> public List<String> add_resources(ResourceType t, Collection<String> values, boolean convertToUnix)
>       throws RuntimeException {
>     Set<String> resourceSet = resourceMaps.getResourceSet(t);
>     Map<String, Set<String>> resourcePathMap = resourceMaps.getResourcePathMap(t);
>     Map<String, Set<String>> reverseResourcePathMap = resourceMaps.getReverseResourcePathMap(t);
>     List<String> localized = new ArrayList<String>();
>     try {
>       for (String value : values) {
>         String key;
>         {color:#d04437}//get the local path of downloaded jars.{color}
>         List<URI> downloadedURLs = resolveAndDownload(t, value, convertToUnix);
>          ;
> 	.
> bq.   List<URI> {color:#d04437}resolveAndDownload{color}(ResourceType t, String value, boolean convertToUnix) throws URISyntaxException,
> bq.       IOException {
> bq.     URI uri = createURI(value);
> bq.     if (getURLType(value).equals("file")) {
> bq.       return Arrays.asList(uri);
> bq.     } else if (getURLType(value).equals("ivy")) {
> bq.       return dependencyResolver.downloadDependencies(uri);
> bq.     } else {{color:#d04437} // goes here for HDFS{color}
> bq.       {color:#d04437}return Arrays.asList(createURI(downloadResource(value, convertToUnix)));{color} 
> bq.     }
> bq.   }
> Thanks,
> Sailee


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)