hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Templeton (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-4958) The file localization process should allow for wildcards to reduce the application footprint in the state store
Date Thu, 21 Apr 2016 13:50:25 GMT

     [ https://issues.apache.org/jira/browse/YARN-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Templeton updated YARN-4958:
-----------------------------------
    Description: 
When using the -libjars option to add classes to the classpath, every library so added is
explicitly listed in the {{ContainerLaunchContext}}'s local resources even though they're
all uploaded to the same directory in HDFS.  When using tools like Crunch without an uber
JAR or when trying to take advantage of the shared cache, the number of libraries can be quite
large.  We've seen many cases where we had to turn down the max number of applications to
prevent ZK from running out of heap because of the size of the state store entries.

Rather than listing all files independently, this JIRA proposes to have the NM allow wildcards
in the resource localization paths.  Specifically, we propose to allow a path to have a final
component (name) set to "*", which is interpreted by the NM as "download the full directory
and link to every file in it from the job's working directory."  This behavior is the same
as the current behavior when using -libjars, but avoids explicitly listing every file.

This JIRA does not attempt to provide more general purpose wildcards, such as "*.jar" or "file*",
as having multiple entries for a single directory presents numerous logistical issues.

This JIRA also does not attempt to integrate with the shared cache.  That work will be left
to a future JIRA.  Specifically, this JIRA only applies when a full directory is uploaded.
 Currently the shared cache does not handle directory uploads.

This JIRA proposes to allow for wildcards both in the internal processing of the -libjars
switch and in paths added through the {{Job}} and {{DistributedCache}} classes.

The proposed approach is to treat a path, "dir/*", as "dir" for purposes of all file verification
and localization.  In the final step, the NM will query the localized directory to get a list
of the files in "dir" such that each can be linked from the job's working directory.  Since
$PWD/* is always included on the classpath, all JAR files in "dir" will be in the classpath.

  was:
When using the -libjars option to add classes to the classpath, every library so added is
explicitly listed in the {{ContainerLaunchContext}}'s local resources even though they're
all uploaded to the same directory in HDFS.  When using tools like Crunch without an uber
JAR or when trying to take advantage of the shared cache, the number of libraries can be quite
large.  We've seen many cases where we had to turn down the max number of applications to
prevent ZK from running out of heap because of the size of the state store entries.

Rather than listing all files independently, this JIRA proposes to have the NM allow wildcards
in the resource localization paths.  Specifically, we propose to allow a path to have a final
component (name) set to "*", which is interpreted by the NM as "download the fell directory
and link to every file in it from the job's working directory."  This behavior is the same
as the current behavior when using -libjars, but avoids explicitly listing every file.

This JIRA does not attempt to provide more general purpose wildcards, such as "*.jar" or "file*",
as having multiple entries for a single directory presents numerous logistical issues.

This JIRA also does not attempt to integrate with the shared cache.  That work will be left
to a future JIRA.

This JIRA proposes to allow for wildcards both in the internal processing of the -libjars
switch and in paths added through the {{Job}} and {{DistributedCache}} classes.

The proposed approach is to treat a path, "dir/*", as "dir" for purposes of all file verification.
 In the final step, the NM will query the localized directory to get a list of the files in
"dir" such that each can be linked from the job's working directory.


> The file localization process should allow for wildcards to reduce the application footprint
in the state store
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4958
>                 URL: https://issues.apache.org/jira/browse/YARN-4958
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>            Priority: Critical
>         Attachments: YARN-4958.001.patch
>
>
> When using the -libjars option to add classes to the classpath, every library so added
is explicitly listed in the {{ContainerLaunchContext}}'s local resources even though they're
all uploaded to the same directory in HDFS.  When using tools like Crunch without an uber
JAR or when trying to take advantage of the shared cache, the number of libraries can be quite
large.  We've seen many cases where we had to turn down the max number of applications to
prevent ZK from running out of heap because of the size of the state store entries.
> Rather than listing all files independently, this JIRA proposes to have the NM allow
wildcards in the resource localization paths.  Specifically, we propose to allow a path to
have a final component (name) set to "*", which is interpreted by the NM as "download the
full directory and link to every file in it from the job's working directory."  This behavior
is the same as the current behavior when using -libjars, but avoids explicitly listing every
file.
> This JIRA does not attempt to provide more general purpose wildcards, such as "*.jar"
or "file*", as having multiple entries for a single directory presents numerous logistical
issues.
> This JIRA also does not attempt to integrate with the shared cache.  That work will be
left to a future JIRA.  Specifically, this JIRA only applies when a full directory is uploaded.
 Currently the shared cache does not handle directory uploads.
> This JIRA proposes to allow for wildcards both in the internal processing of the -libjars
switch and in paths added through the {{Job}} and {{DistributedCache}} classes.
> The proposed approach is to treat a path, "dir/*", as "dir" for purposes of all file
verification and localization.  In the final step, the NM will query the localized directory
to get a list of the files in "dir" such that each can be linked from the job's working directory.
 Since $PWD/* is always included on the classpath, all JAR files in "dir" will be in the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message