hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sailee Jain (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-16999) Performance bottleneck in the add_resource api
Date Fri, 30 Jun 2017 01:16:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-16999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sailee Jain updated HIVE-16999:
-------------------------------
    Description: 
Performance bottleneck is found in adding resource[lying on hdfs] to the distributed cache.

Commands used are :-

{code:java}
1. ADD ARCHIVE "hdfs://some_dir/archive.tar"
2. ADD FILE "hdfs://some_dir/file.txt"
{code}

Here is the log corresponding to the archive adding operation:-

{noformat}
 converting to local hdfs://some_dir/archive.tar
 Added resources: [hdfs://some_dir/archive.tar
{noformat}


Hive is downloading the resource to the local filesystem [shown in log by "converting to local"].

Ideally there is no need to bring the file to the local filesystem when this operation is
all about copying the file from one location on HDFS to other location on HDFS[distributed
cache].
This adds lot of performance bottleneck when the the resource is a big file and all commands
need the same resource.
After debugging around the impacted piece of code is found to be :-





{code:java}
public List<String> add_resources(ResourceType t, Collection<String> values, boolean
convertToUnix)
      throws RuntimeException {
    Set<String> resourceSet = resourceMaps.getResourceSet(t);
    Map<String, Set<String>> resourcePathMap = resourceMaps.getResourcePathMap(t);
    Map<String, Set<String>> reverseResourcePathMap = resourceMaps.getReverseResourcePathMap(t);
    List<String> localized = new ArrayList<String>();
    try {
      for (String value : values) {
        String key;
         {color:#d04437}//get the local path of downloaded jars{color}
        List<URI> downloadedURLs = resolveAndDownload(t, value, convertToUnix);
         ;
	.
{code}


{code:java}
  List<URI> resolveAndDownload(ResourceType t, String value, boolean convertToUnix)
throws URISyntaxException,
      IOException {
    URI uri = createURI(value);
    if (getURLType(value).equals("file")) {
      return Arrays.asList(uri);
    } else if (getURLType(value).equals("ivy")) {
      return dependencyResolver.downloadDependencies(uri);
    } else { // goes here for HDFS
      return Arrays.asList(createURI(downloadResource(value, convertToUnix))); // Here when
the resource is not local it will download it to the local machine.
    }
  }
{code}





Thanks,
Sailee

  was:
Performance bottleneck is found in adding resource[lying on hdfs] to the distributed cache.

Commands used are :-

{code:java}
1. ADD ARCHIVE "{color:#d04437}hdfs{color}://some_dir/archive.tar"
2. ADD FILE "{color:#d04437}hdfs{color}://some_dir/file.txt"
{code}

Here is the log corresponding to the archive adding operation:-

{noformat}
 converting to local hdfs://some_dir/archive.tar
 Added resources: [hdfs://some_dir/archive.tar
{noformat}


Hive is downloading the resource to the local filesystem [shown in log by "converting to local"].

Ideally there is no need to bring the file to the local filesystem when this operation is
all about copying the file from one location on HDFS to other location on HDFS[distributed
cache].
This adds lot of performance bottleneck when the the resource is a big file and all commands
need the same resource.
After debugging around the impacted piece of code is found to be :-


{noformat}
public List<String> add_resources(ResourceType t, Collection<String> values, boolean
convertToUnix)
      throws RuntimeException {
    Set<String> resourceSet = resourceMaps.getResourceSet(t);
    Map<String, Set<String>> resourcePathMap = resourceMaps.getResourcePathMap(t);
    Map<String, Set<String>> reverseResourcePathMap = resourceMaps.getReverseResourcePathMap(t);
    List<String> localized = new ArrayList<String>();
    try {
      for (String value : values) {
        String key;
         {color:#d04437}//get the local path of downloaded jars{color}
        List<URI> downloadedURLs = resolveAndDownload(t, value, convertToUnix);
         ;
	.
{noformat}


{noformat}
  List<URI> resolveAndDownload(ResourceType t, String value, boolean convertToUnix)
throws URISyntaxException,
      IOException {
    URI uri = createURI(value);
    if (getURLType(value).equals("file")) {
      return Arrays.asList(uri);
    } else if (getURLType(value).equals("ivy")) {
      return dependencyResolver.downloadDependencies(uri);
    } else { // goes here for HDFS
      return Arrays.asList(createURI(downloadResource(value, convertToUnix))); // Here when
the resource is not local it will download it to the local machine.
    }
  }
{noformat}



Thanks,
Sailee


> Performance bottleneck in the add_resource api
> ----------------------------------------------
>
>                 Key: HIVE-16999
>                 URL: https://issues.apache.org/jira/browse/HIVE-16999
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: Sailee Jain
>            Priority: Critical
>
> Performance bottleneck is found in adding resource[lying on hdfs] to the distributed
cache. 
> Commands used are :-
> {code:java}
> 1. ADD ARCHIVE "hdfs://some_dir/archive.tar"
> 2. ADD FILE "hdfs://some_dir/file.txt"
> {code}
> Here is the log corresponding to the archive adding operation:-
> {noformat}
>  converting to local hdfs://some_dir/archive.tar
>  Added resources: [hdfs://some_dir/archive.tar
> {noformat}
> Hive is downloading the resource to the local filesystem [shown in log by "converting
to local"]. 
> Ideally there is no need to bring the file to the local filesystem when this operation
is all about copying the file from one location on HDFS to other location on HDFS[distributed
cache].
> This adds lot of performance bottleneck when the the resource is a big file and all commands
need the same resource.
> After debugging around the impacted piece of code is found to be :-
> {code:java}
> public List<String> add_resources(ResourceType t, Collection<String> values,
boolean convertToUnix)
>       throws RuntimeException {
>     Set<String> resourceSet = resourceMaps.getResourceSet(t);
>     Map<String, Set<String>> resourcePathMap = resourceMaps.getResourcePathMap(t);
>     Map<String, Set<String>> reverseResourcePathMap = resourceMaps.getReverseResourcePathMap(t);
>     List<String> localized = new ArrayList<String>();
>     try {
>       for (String value : values) {
>         String key;
>          {color:#d04437}//get the local path of downloaded jars{color}
>         List<URI> downloadedURLs = resolveAndDownload(t, value, convertToUnix);
>          ;
> 	.
> {code}
> {code:java}
>   List<URI> resolveAndDownload(ResourceType t, String value, boolean convertToUnix)
throws URISyntaxException,
>       IOException {
>     URI uri = createURI(value);
>     if (getURLType(value).equals("file")) {
>       return Arrays.asList(uri);
>     } else if (getURLType(value).equals("ivy")) {
>       return dependencyResolver.downloadDependencies(uri);
>     } else { // goes here for HDFS
>       return Arrays.asList(createURI(downloadResource(value, convertToUnix))); // Here
when the resource is not local it will download it to the local machine.
>     }
>   }
> {code}
> Thanks,
> Sailee



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message