hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miklos Szegedi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2185) Use pipes when localizing archives
Date Wed, 10 Jan 2018 00:22:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319446#comment-16319446
] 

Miklos Szegedi commented on YARN-2185:
--------------------------------------

Thank you for the review, [~jlowe] and [~grepas].

bq. This patch adds parallel copying of directories and ability to ignore timestamps which
IMHO is outside the scope of this patch. I think those should be handled in separate JIRAs
with appropriate unit tests.
I created YARN-7713 and YARN-7712 for these two ideas respectively.
bq. Do we really want a while loop to pull off extensions? That's not how it behaved before,
and it will now do different, potentially unexpected, things for a ".jar.gz", ".tgz.tgz",
".gz.jar", etc. Granted these are unlikely filenames to encounter, but if somehow a user had
those before and were working, this will change that behavior.
I agree, let's be more conservative and do not change the existing behavior.
bq. Does it make sense to use an external process to unpack .jar and .zip files? The original
code did this via the JDK directly rather than requiring a subprocess, and given there's a
ZipInputStream and JarInputStream, it seems we could do the same here with the input stream
rather than requiring the overhead of a subprocess and copying of data between streams.
I was wondering, if it should be better for OS specific archives containing symlinks for example.
Here is what I suggest. I do not like the idea to use different method for tar and zip. On
the other hand we do not have enough information to know, how many users would actually work
better with OS toolset. I am not in favor of too many config knobs either but in this case
I created one. The administrator can decide whether to rely on OS tools altogether or java
itself. Calling out jar indeed does not make much sense, since it does the same what we would
do in the child process. I think this solution is much cleaner than arbitrarily use tar in
Linux but rely on Java for jars.
bq. This now logs an INFO message for all stderr and stdout for the subprocess
I addressed this concern.
bq. Wouldn't "cd destpath && tar -xf" be better than "cd destpath; tar -xf"? Otherwise
if somehow the change directory fails it will end up unpacking to an unexpected place.
I just followed the existing pattern. I agree, let’s be more resilient and I will change
to && as you suggested.
bq. Is the "rm -rf" necessary?
It does not hurt. I replaced with a delete that just returns, if the directory does not exist
bq. Has this been tested on Windows? The zipOrJar path appears to be eligible for Windows
but uses shell syntax that cmd is not going to understand, like semicolons between commands.
Good catch, the latest code uses the Java path now for Windows. However, I have limited ability
to test there.
bq. I wonder if it would be better to add a method to Shell that takes an input stream so
we could reuse all the subprocess handling code there.
My usage of ProcessBuilder seems very simple I am hesitant to overload the shell executor
with this functionality. This helps with backports as well.

I think I addressed most other comments in the next patch that I will submit soon.

> Use pipes when localizing archives
> ----------------------------------
>
>                 Key: YARN-2185
>                 URL: https://issues.apache.org/jira/browse/YARN-2185
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.4.0
>            Reporter: Jason Lowe
>            Assignee: Miklos Szegedi
>         Attachments: YARN-2185.000.patch, YARN-2185.001.patch, YARN-2185.002.patch
>
>
> Currently the nodemanager downloads an archive to a local file, unpacks it, and then
removes it.  It would be more efficient to stream the data as it's being unpacked to avoid
both the extra disk space requirements and the additional disk activity from storing the archive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message