hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11509) change parsing sequence in GenericOptionsParser to parse -D parameters first
Date Mon, 26 Jan 2015 20:25:36 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292357#comment-14292357

Chris Nauroth commented on HADOOP-11509:

Thank you, Xuan and Jian.

Just to provide a bit more background on this, Xuan found that streaming jobs using files
in Azure Storage were not able to override the setting of {{fs.azure.block.size}} from the
command line.  It looks like he found the root cause is that {{validateFiles}} checks for
existence of files against a {{FileSystem}} instance, but this {{FileSystem}} instance is
obtained before handling -D options.  This would mean we then have an instance sitting in
the {{FileSystem}} cache that was created without the -D options set in the {{Configuration}}.
 Later, during MapReduce job split calculation, it would use the cached instance that didn't
have the override of {{fs.azure.block.size}}.

I agree with the change here, because the expectation is that the command line arguments take
precedence.  However, I don't think we should move the -D handling all the way to the top
of the method.  Right now, the handling is such that -D options would take precedence over
-fs and -jt.  The current patch would reverse that.  I don't know if anyone depends on that
behavior, but we can avoid changing it by doing the -D handling in between the handling of
-conf and the handling of -libjars.  I'd be +1 for the patch with that change if you test
it and it still works for overriding {{fs.azure.block.size}}.

bq. Should the API Path.getFileSystem(Configuration conf) be that the returned file system
object always apply the up-to-date conf ?

This is a long-standing weakness of the {{FileSystem}} cache.  It has been discussed in other
jiras, but I can't find those now.  The {{FileSystem}} cache key is composed of scheme, authority,
and {{UserGroupInformation}}.  However, the {{FileSystem#get}} API is phrased in terms of
a whole {{Configuration}}.  Various other configuration properties can tune the behavior of
a {{FileSystem}}, but if you get a cached instance, then these configuration properties might
not be applied.  OTOH, it would be too costly to make the whole {{Configuration}} part of
the cache key.

This is an existing problem, unrelated to the current patch.

> change parsing sequence in GenericOptionsParser to parse -D parameters first
> ----------------------------------------------------------------------------
>                 Key: HADOOP-11509
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11509
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Xuan Gong
>            Assignee: Xuan Gong
>         Attachments: HADOOP-11509.1.patch
> In GenericOptionsParser, we need to parse -D parameter first. In that case, the user
input parameter (through -D) can be set into configuration object earlier and used to process
other parameters.

This message was sent by Atlassian JIRA

View raw message