hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2185) Use pipes when localizing archives
Date Mon, 08 Jan 2018 18:53:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316777#comment-16316777

Jason Lowe commented on YARN-2185:

Thanks for the patch!

This patch adds parallel copying of directories and ability to ignore timestamps which IMHO
is outside the scope of this patch.  I think those should be handled in separate JIRAs with
appropriate unit tests.

Do we really want a while loop to pull off extensions?  That's not how it behaved before,
and it will now do different, potentially unexpected, things for a ".jar.gz", ".tgz.tgz",
".gz.jar", etc.  Granted these are unlikely filenames to encounter, but if somehow a user
had those before and were working, this will change that behavior.

Does it make sense to use an external process to unpack .jar and .zip files?  The original
code did this via the JDK directly rather than requiring a subprocess, and given there's a
ZipInputStream and JarInputStream, it seems we could do the same here with the input stream
rather than requiring the overhead of a subprocess and copying of data between streams.

This now logs an INFO message for all stderr and stdout for the subprocess (i.e.: at least
every entry in the archive), which is not something it did before.  This should be debug,
if logged at all, and it requires buffering all of the output.  Some of our users have some
crazy use cases with very large/deep tarballs, and grabbing all the output from tar would
not be a trivial amount of memory for the NM or a localizer heap, especially if there are
multiple of these at the same time.  If the log message is kept, the verbose flag should be
passed to the command only if the output will be logged.

I think it would be better to explicitly create the directory from the Java code rather than
squirrel it into the tar command.  We can provide a better error message with more context
if done directly, and as it is now the mkdir can fail but blindlly proceed to the tar command.

Wouldn't "cd _destpath_ && tar -xf" be better than "cd _destpath_; tar -xf"?  Otherwise
if somehow the change directory fails it will end up unpacking to an unexpected place.

Is the "rm -rf" necessary?

Has this been tested on Windows?  The zipOrJar path appears to be eligible for Windows but
uses shell syntax that {{cmd}} is not going to understand, like semicolons between commands.

Does the single-quote escaping work properly?  Escaping double quotes with a backslash works
within a double-quoted string, but the same is not true for single-quoted strings.  The shell
does not interpret any characters in a single-quoted string, even backslash characters.  Therefore
the only way I know of to put a single-quote character in a single-quoted string is to terminate
the current string, insert an escaped single quote (since we're not in a single-quote parsing
context at this point), then start the single-quote string up again.  In other words, each
{{'}} character trying to be escaped within a single-quote context becomes the four character
sequence: {{'\''}}  For example, single-quoting {{abc'def'ghi}} becomes {{'abc'\''def'\''ghi'}}.

I think we should shutdown the executor service if we created it, otherwise we risk running
out of threads before the garbage collector gets around to noticing it can free everything
up.  Either that or require the caller to provide the executor.  Note that if the caller provides
a single thread executor (or some other executor that can only execute serially) then this
can deadlock if the process emits a lot of output on stderr.  The thread trying to consume
stdout data is blocked until the child process completes, but the child process is blocked
trying to emit stderr output that the parent process is not consuming.  I wonder if it would
be better to add a method to Shell that takes an input stream so we could reuse all the subprocess
handling code there.

This debug log was upgraded to an info, yet it still has the debug check in front of it. 
I'm not sure this should be upgraded to an info log as part of this change.
    if (LOG.isDebugEnabled()) {
      LOG.info(String.format("Starting to download %s %s %s",

I don't see how the following code treats PATTERN files that don't end in '.jar' like ARCHIVE
files since the resource type check will still find PATTERN instead of ARCHIVE when it falls
through from the first {{if}} to the second:
      if (resource.getType() == LocalResourceType.PATTERN) {
        if (destinationFile.endsWith(".jar")) {
          // Unpack and keep a copy of the whole jar for mapreduce
          String p = resource.getPattern();
          RunJar.unJarAndSave(inputStream, new File(destination.toUri()),
              p == null ? RunJar.MATCH_ANY : Pattern.compile(p));
        } else {
          LOG.warn("Treating [" + source + "] " +
              "as an archive even though it " +
              "was specified as PATTERN");
      if (resource.getType() == LocalResourceType.ARCHIVE) {
        if (FileUtil.decompress(
            inputStream, source.getName(), destination, executor)) {
        } else {
          LOG.warn("Cannot unpack [" + source + "] trying to copy");

Is there a reason to implement the copy loop directly in the new unJar method rather than
calling copyBytes as the other one does?

> Use pipes when localizing archives
> ----------------------------------
>                 Key: YARN-2185
>                 URL: https://issues.apache.org/jira/browse/YARN-2185
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.4.0
>            Reporter: Jason Lowe
>            Assignee: Miklos Szegedi
>         Attachments: YARN-2185.000.patch, YARN-2185.001.patch, YARN-2185.002.patch
> Currently the nodemanager downloads an archive to a local file, unpacks it, and then
removes it.  It would be more efficient to stream the data as it's being unpacked to avoid
both the extra disk space requirements and the additional disk activity from storing the archive.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message