hadoop-submarine-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szilard Nemeth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SUBMARINE-47) Provide an implementation to parse configuration values from a YAML file
Date Tue, 02 Apr 2019 14:28:00 GMT

    [ https://issues.apache.org/jira/browse/SUBMARINE-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16807805#comment-16807805
] 

Szilard Nemeth commented on SUBMARINE-47:
-----------------------------------------

Thanks [~sunilg] for your comments!

 

1. Fixed the typo
2. As agreed with Gergo / Zoltan on this comment (https://issues.apache.org/jira/browse/SUBMARINE-47?focusedCommentId=16802829&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16802829),
if any CLI argument is specified over the YAML config file then the CLI will have precedence
over YAML.
Precedence is handled in: 
- org.apache.hadoop.yarn.submarine.client.cli.param.ParametersHolder#getOptionValue
- org.apache.hadoop.yarn.submarine.client.cli.param.ParametersHolder#getOptionValues

3. Refactored the messages to be clear with the type of the error: 
A.) Not existing file is handled as a distinct type of exception with proper message
B.) empty file is handled as a distinct type of exception with proper message
C.) If any other exception is caught, the details will be printed, for example:

 
{code:java}
Exception while parsing YAML file /var/folders/yj/qt9jnx2s6tvfmdm7kb8zwf2r0000gp/T/test2080393779705028964.yaml
while parsing a block mapping
 in 'reader', line 17, column 1:
 spec:
 ^
expected <block end>, but found BlockMappingStart
 in 'reader', line 23, column 3:
 checkpoint_path: testCheckpointPath
 ^

{code}
 

*The topic you brought up about handling the exceptions and printing only a short error message
is a good idea, but I think it's not in the scope of this patch, so I suggest a follow-up
jira.*
This is especially true as {{org.apache.hadoop.yarn.submarine.client.cli.RunJobCli#run}} throws
CLI parse exceptions ({{org.apache.commons.cli.ParseException}}) and {{org.apache.hadoop.yarn.submarine.client.cli.Cli#main}}
does not catch exceptions at all! This was the case before my patch as well!
What do you think?

4. That's a good point, removed the last argument.
5. Yes, if some value is missing from any section (e.g. security/principal) then its value
will be null in the final map!
What did you have in mind, what other values should I put into the map for the missing values?

Added some testcases to verify these cases explicitly here: {{org.apache.hadoop.yarn.submarine.client.cli.TestRunJobCliParsingYaml}}
Thanks for pointing out a possible NPE, a null-check was indeed missing from convertToEnvsList,
this is also covered with a testcase.

6. Added javadoc to hasOption, but I haven't found any more public methods added with this
patch that is undocumented.
7. Since the code in ParametersHolder runs only at startup and the debug messages would be
printed some tens of times (which is not that many), I wouldn't bother adding if conditions
before every LOG.debug() call as it won't improve the performance significantly

Please let me know if these makes sense!
Thanks!

> Provide an implementation to parse configuration values from a YAML file
> ------------------------------------------------------------------------
>
>                 Key: SUBMARINE-47
>                 URL: https://issues.apache.org/jira/browse/SUBMARINE-47
>             Project: Hadoop Submarine
>          Issue Type: New Feature
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: SUBMARINE-47.001.patch, SUBMARINE-47.002.patch, SUBMARINE-47.002.patch,
SUBMARINE-47.003.patch, SUBMARINE-47.003.patch, SUBMARINE-47.004.patch, SUBMARINE-47.005.patch,
valid-config.yaml
>
>
> In the Submarine design doc ([https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.klmv5hrbgx9l]),
under section "Job commands" (submarine job run), the document talks about the need to implement
a YAML parser as an alternative configuration source of Submarine.
>  This jira provides an implementation of this YAML configuration parser.
> *Some details on the implementation:*
> *Data model and YAML library:* 
>  SnakeYAML was the library of choice, as we already include it in other Hadoop maven
modules.
>  SnakeYAML requires data classes to parse the YAML structure into Java classes, this
is very similar to parsing JSON data.
>  The data classes are under this package: org.apache.hadoop.yarn.submarine.client.cli.param.yaml
>  The data model is the following: 
>  1. On the root level, we have the class YamlConfigFile. This class has fields for configs,
roles, scheduling, etc.
>  2. Class Configs is a class that stores various config values.
>  3. We have the class Roles, that stores the role configurations, currently we only have
Worker and Ps (ParameterServer).
>  4. There's a base class for Roles, called Role. For some reasons, SnakeYAML did not
like this class as an abstract class, but it's not that important to have it like that, so
I stopped caring.
> There are several other data classes, please note that all of these classes are trying
to "mirror" the example YAML file from the design doc, so you can check that for more details.
> *Testing:* 
>  Added a new class called TestRunJobCliParsingYaml that exclusively verifies the correctness
of YAML parsing.
>  I tried to define the names of the testcases as descriptive and easy to understand as
possible.
>  Covering some edge cases was also a quite important factor, like checking how parsing
handles non-existing files, empty files, etc.
> *ParameterHolder:*
> In order to minimize changes in RunJobParameters, I introduced a ParameterHolder class
that is passed RunJobParameters#updateParametersByParsedCommandline instead of the parsed
CLI object we had before. This way, I could use the ParameterHolder to store the parsed CLI
values along with parsed YAML values, so it's a single source of configuration. This class
can act as a "decision point" about future needs on value precedences: For example, if we
want to have precedence for some configs coming from the CLI over the same configs coming
from the YAML file (or vica-versa), it is easily doable with the current implementation.
>  Currently, all values coming from the YAML config are in precedence.
> *What CLI options are covered by YAML:*
> Every CLI option have it's YAML counterpart, so the coverage is 100%.
>  See this method for all the available options: RunJobCli#generateOptions.
>  However, the design doc contains 2 fields (called reservation and placement) under the
'scheduling' section that are not added to the YAML data classes. The reason for this is that
I haven't found any implementation for these properties with the original CLI parser, either.
> *Questions:* 
>  1. Do we want to allow mixing of CLI configs and YAML configs? We could either fail-fast
if a config is defined twice. With my current implementation, the values coming from YAML
always have precedence over CLI values, so defining a config twice is not an error case yet.
>  2. What types of roles should we allow? Is there any additional apart from Worker and
Ps?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message