hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "UnixShellScriptProgrammingGuide" by SomeOtherAccount
Date Tue, 31 May 2016 17:04:25 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "UnixShellScriptProgrammingGuide" page has been changed by SomeOtherAccount:
https://wiki.apache.org/hadoop/UnixShellScriptProgrammingGuide?action=diff&rev1=20&rev2=21

Comment:
More dynamic subcommands updates

  ## page was renamed from ShellScriptProgrammingGuide
  = Introduction =
- 
  With [[https://issues.apache.org/jira/browse/HADOOP-9902|HADOOP-9902]], the shell script
code base has been refactored, with common functions and utilities put into a shell library
(hadoop-functions.sh).  Here are some tips and tricks to get the most out of using this functionality:
  
  = The Skeleton =
- 
  All properly built shell scripts contain the following sections:
  
   1. `hadoop_usage` function that contains an alphabetized list of subcommands and their
description.  This is used when the user directly asks for help, a command line syntax error,
etc.
  
-  2. `HADOOP_LIBEXEC_DIR` configured.  This should be the location of where `hadoop-functions.sh`,
`hadoop-config.sh`, etc, are located.
+  1. `HADOOP_LIBEXEC_DIR` configured.  This should be the location of where `hadoop-functions.sh`,
`hadoop-config.sh`, etc, are located.
  
-  3. `HADOOP_NEW_CONFIG=true`.  This tells the rest of the system that the code being executed
is aware that it is using the new shell API and it will call the routines it needs to call
on its own.  If this isn't set, then several default actions that were done in Hadoop 2.x
and earlier are executed and several key parts of the functionality are lost.
+  1. `HADOOP_NEW_CONFIG=true`.  This tells the rest of the system that the code being executed
is aware that it is using the new shell API and it will call the routines it needs to call
on its own.  If this isn't set, then several default actions that were done in Hadoop 2.x
and earlier are executed and several key parts of the functionality are lost.
  
-  4. `$HADOOP_LIBEXEC_DIR/abc-config.sh` is executed, where abc is the subproject.  HDFS
scripts should call `hdfs-config.sh`. MAPRED scripts should call `mapred-config.sh` YARN scripts
should call `yarn-config.sh`.  Everything else should call `hadoop-config.sh`. This does a
lot of standard initialization, processes standard options, etc. This is also what provides
override capabilities for subproject specific environment variables. For example, the system
will normally ignore `yarn-env.sh`, but `yarn-config.sh` will activate those settings.
+  1. `$HADOOP_LIBEXEC_DIR/abc-config.sh` is executed, where abc is the subproject.  HDFS
scripts should call `hdfs-config.sh`. MAPRED scripts should call `mapred-config.sh` YARN scripts
should call `yarn-config.sh`.  Everything else should call `hadoop-config.sh`. This does a
lot of standard initialization, processes standard options, etc. This is also what provides
override capabilities for subproject specific environment variables. For example, the system
will normally ignore `yarn-env.sh`, but `yarn-config.sh` will activate those settings.
  
-  5. At this point, this is where the majority of your code goes.  Programs should process
the rest of the arguments and doing whatever their script is supposed to do.
+  1. At this point, this is where the majority of your code goes.  Programs should process
the rest of the arguments and doing whatever their script is supposed to do.
  
-  6. Before executing a Java program (preferably via hadoop_java_exec) or giving user output,
call `hadoop_finalize`.  This finishes up the configuration details: adds the user class path,
fixes up any missing Java properties, configures library paths, etc.  
+  1. Before executing a Java program (preferably via hadoop_java_exec) or giving user output,
call `hadoop_finalize`.  This finishes up the configuration details: adds the user class path,
fixes up any missing Java properties, configures library paths, etc.
  
-  7. Either an `exit` or an `exec`.  This should return 0 for success and 1 or higher for
failure.
+  1. Either an `exit` or an `exec`.  This should return 0 for success and 1 or higher for
failure.
  
- = Adding a Subcommand to an Existing Script =
+ = Adding a Subcommand to an Existing Script (NOT hadoop-tools-based) =
- 
  In order to add a new subcommand, there are two things that need to be done:
  
   1. Add a line to that script's `hadoop_usage` function that lists the name of the subcommand
and what it does.  This should be alphabetized.
  
-  2. Add an additional entry in the case conditional. Depending upon what is being added,
several things may need to be done:
+  1. Add an additional entry in the case conditional. Depending upon what is being added,
several things may need to be done:
+   a. Set the `HADOOP_CLASSNAME` to the Java method. b. Add $HADOOP_CLIENT_OPTS to $HADOOP_OPTS
(or, for YARN apps, $YARN_CLIENT_OPTS to $YARN_OPTS) if this is an interactive application
or for some other reason should have the user client settings applied.
+   c. For methods that can also be daemons, set `HADOOP_SUBCMD_SUPPORTDAEMONIZATION=true`.
 This will allow for the `--daemon` option to work. See more below.
+   d. If it supports security, set `HADOOP_SUBCMD_SECURESERVICE=true` and `HADOOP_SUBCMD_SECUREUSER`
equal to the user that should run the daemon.
  
-   a. Set the `CLASS` to the Java method.
+  1. If a new subcommand needs one or more extra environment variables:
+   a. Add documentation and a '''commented''' out example that shows the default setting.
b. Add the default(s) to that subprojects' hadoop_subproject_init or hadoop_basic_init for
common, using the current shell vars as a guide. (Specifically, it should allow overriding!)
  
-   b. Add $HADOOP_CLIENT_OPTS to $HADOOP_OPTS (or, for YARN apps, $YARN_CLIENT_OPTS to $YARN_OPTS)
if this is an interactive application or for some other reason should have the user client
settings applied.
+ = Adding a Subcommand to an Existing Script (hadoop-tools-based) =
+ As of HADOOP-12930, subcommands that come from hadoop-tools utilizing the Dynamic Subcommands
functionality.  This allows for end-users to replace/override these utilities with their own
versions as well as prevent the classpath from exploding with extra dependencies.
  
-   c. For methods that can also be daemons, set `supportdaemonization=true`.  This will allow
for the `--daemon` option to work. See more below.
+  1. Create a src/main/shellprofile.d directory
+  1. Inside there, create a hadoop-name.sh file that contains the bash functions necessary
to create a Dynamic Subcommand.  Note that versions that ship with hadoop need to verify that
the function doesn't already exist.  (See, for example, hadoop-archives/src/main/shellprofile.d)
+  1. Modify the hadoop-tools assembly to copy this shellprofile.d in the correct place.
+  1. To get hadoop_add_to_classpath_tools functionality to work for your command, add the
following to your pom.xml.  This uses the Maven dependency plug-in to create a file that the
build system will use to create the file needed by that function.
  
-   d. If it supports security, set `secure_service=true` and `secure_user` equal to the user
that should run the daemon.
- 
-  3. If a new subcommand needs one or more extra environment variables:
- 
-   a. Add documentation and a '''commented''' out example that shows the default setting.
- 
-   b. Add the default(s) to that subprojects' hadoop_subproject_init or hadoop_basic_init
for common, using the current shell vars as a guide. (Specifically, it should allow overriding!)

+ {{{
+        <plugin>
+         <groupId>org.apache.maven.plugins</groupId>
+         <artifactId>maven-dependency-plugin</artifactId>
+         <executions>
+           <execution>
+             <id>deplist</id>
+             <phase>compile</phase>
+             <goals>
+               <goal>list</goal>
+             </goals>
+             <configuration>
+               <!-- referenced by a built-in command -->
+               <outputFile>${project.basedir}/target/hadoop-tools-deps/${project.artifactId}.tools-builtin.txt</outputFile>
+             </configuration>
+           </execution>
+         </executions>
+       </plugin>
+ }}}
  
  
  = Better Practices =
- 
   * Avoid adding more globals or project specific globals and/or entries in *-env.sh and/or
a comment at the bottom here.  In a lot of cases, there is pre-existing functionality that
already does what you might need to do.  Additionally, every configuration option makes it
that much harder for end users. If you do need to add a new global variable for additional
functionality, start it with HADOOP_ for common, HDFS_ for HDFS, YARN_ for YARN, and MAPRED_
for MapReduce.  It should be documented in either *-env.sh (for user overridable parts) or
hadoop-functions.sh (for internal-only globals). This helps prevents our variables from clobbering
other people.
  
   * Remember that abc_xyz_OPTS can and should act as a catch-all for Java daemon options.
 Custom heap environment variables and other custom daemon variables add unnecessary complexity
for both the user and us.  They should be avoided.  In almost every case, it is better to
have a global and apply it to all daemons to have a universal default.  Users can/will override
that variables as necessary in their init scripts.  This also helps cover the case when functionality
starts in one chunk of Hadoop but ends up in multiple places.
@@ -69, +85 @@

   * A decent shell lint is available at http://www.shellcheck.net .  Mac users can `brew
install shellcheck` to install it locally. Like lint, however, be aware that it will sometimes
flag things that are legitimate. These can be marked using a 'shellcheck disable' comment.
(Usually, the flag for $HADOOP_OPTS being called without quotes is our biggest offense that
shellcheck flags.  Our usage without quotes is correct for the current code base.  It is,
however, a bad practice and shellcheck is correct for telling us about it.)
  
  = Standard Environment Variables =
- 
  In addition to all of the variables documented in `*-env.sh` and `hadoop-layout.sh`, there
are a handful of special env vars:
  
  * `HADOOP_HEAP_MAX` - This is the Xmx parameter to be passed to Java. (e.g., `"-Xmx1g"`).
 This is present for backward compatibility, however it should be added to `HADOOP_OPTS` via
`hadoop_add_param HADOOP_OPTS Xmx "${JAVA_HEAP_MAX}"` prior to calling `hadoop_finalize`.
@@ -84, +99 @@

  
  * `HADOOP_SLAVE_NAMES` - This is the list of hosts the user passed via `--hostnames`.
  
- 
  = About Daemonization =
- 
  In branch-2 and previous, "daemons" were handled via wrapping "standard" command lines.
If we concentrate on the functionality (vs. the code rot...) this has some interesting (and
inconsistent) results, especially around logging and pid files. If you run the `*-daemon`
version, you got a pid file and `hadoop.root.logger` is set to be `INFO,(something)`. When
a daemon is run in non-daemon mode (e.g., straight up: `hdfs namenode`), no pid file is generated
and `hadoop.root.logger` is kept as `INFO,console`. With no pid file generated, it is possible
to run, e.g. hdfs namenode, both in *-daemon.sh mode and in straight up mode again. It also
means that one needs to pull apart the process list to safely determine the status of the
daemon since pid files aren't always created. This made building custom init scripts fraught
with danger. This inconsistency has been a point of frustration for many operations teams.
  
  Post-HADOOP-9902, there is a slight change in the above functionality and one of the key
reasons why this is an incompatible change. Sub-commands that were intended to run as daemons
(either fully, e.g., namenode or partially, e.g. balancer) have all of this handling consolidated,
helping to eliminate code rot as well as providing a consistent user experience across projects.
daemon=true, which is a per-script local, but is consistent across the hadoop sub-projects,
tells the latter parts of the shell code that this sub-command needs to have some extra-handling
enabled beyond the normal commands. In particular, `supportdaemonization=true` sub-commands
will always get pid and out files. They will prevent two being run on the same machine by
the same user simultaneously (see footnote 1, however). They get some extra options on the
java command line. Etc, etc.
@@ -100, +113 @@

  1-... unless `HADOOP_IDENT_STRING` is modified appropriately. This means that post-HADOOP-9902,
it is now possible to run two secure datanodes on the same machine as the same user, since
all of the logs, pids, and outs, take that into consideration! QA folks should be very happy.
  
  = A New Subproject or Subproject-like Structure =
- 
  The following files should be the basis of the new bits:
  
  * libexec/(project)-config.sh

---------------------------------------------------------------------
To unsubscribe, e-mail: common-commits-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-commits-help@hadoop.apache.org


Mime
View raw message