hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From whe...@apache.org
Subject [07/12] hadoop git commit: HADOOP-11633. Convert remaining branch-2 .apt.vm files to markdown. Contributed by Masatake Iwasaki.
Date Wed, 11 Mar 2015 21:31:32 GMT
http://git-wip-us.apache.org/repos/asf/hadoop/blob/e75e6c66/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/MapredCommands.apt.vm
----------------------------------------------------------------------
diff --git a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/MapredCommands.apt.vm
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/MapredCommands.apt.vm
deleted file mode 100644
index 17fad0c..0000000
--- a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/MapredCommands.apt.vm
+++ /dev/null
@@ -1,227 +0,0 @@
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements.  See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License.  You may obtain a copy of the License at
-~~
-~~     http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-  ---
-  MapReduce Commands Guide
-  ---
-  ---
-  ${maven.build.timestamp}
-
-MapReduce Commands Guide
-
-%{toc|section=1|fromDepth=2|toDepth=4}
-
-* Overview
-
-  MapReduce commands are invoked by the <<<bin/mapred>>> script. Running
the
-  script without any arguments prints the description for all commands.
-
-   Usage: <<<mapred [--config confdir] [--loglevel loglevel] COMMAND>>>
-
-   MapReduce has an option parsing framework that employs parsing generic
-   options as well as running classes.
-
-*-------------------------+---------------------------------------------------+
-|| COMMAND_OPTIONS        || Description                                      |
-*-------------------------+---------------------------------------------------+
-| --config confdir | Overwrites the default Configuration directory. Default
-|                  | is $\{HADOOP_PREFIX\}/conf.
-*-------------------------+---------------------------------------------------+
-| --loglevel loglevel | Overwrites the log level. Valid log levels are FATAL,
-|                     | ERROR, WARN, INFO, DEBUG, and TRACE. Default is INFO.
-*-------------------------+---------------------------------------------------+
-| COMMAND COMMAND_OPTIONS | Various commands with their options are described
-|                         | in the following sections. The commands have been
-|                         | grouped into {{User Commands}} and
-|                         | {{Administration Commands}}.
-*-------------------------+---------------------------------------------------+
-
-* User Commands
-
-   Commands useful for users of a hadoop cluster.
-
-** <<<pipes>>>
-
-   Runs a pipes job.
-
-   Usage: <<<mapred pipes [-conf <path>] [-jobconf <key=value>, <key=value>,
-   ...] [-input <path>] [-output <path>] [-jar <jar file>] [-inputformat
-   <class>] [-map <class>] [-partitioner <class>] [-reduce <class>]
[-writer
-   <class>] [-program <executable>] [-reduces <num>]>>>
-
-*----------------------------------------+------------------------------------+
-|| COMMAND_OPTION                        || Description
-*----------------------------------------+------------------------------------+
-| -conf <path>                           | Configuration for job
-*----------------------------------------+------------------------------------+
-| -jobconf <key=value>, <key=value>, ... | Add/override configuration for job
-*----------------------------------------+------------------------------------+
-| -input <path>                          | Input directory
-*----------------------------------------+------------------------------------+
-| -output <path>                         | Output directory
-*----------------------------------------+------------------------------------+
-| -jar <jar file>                        | Jar filename
-*----------------------------------------+------------------------------------+
-| -inputformat <class>                   | InputFormat class
-*----------------------------------------+------------------------------------+
-| -map <class>                           | Java Map class
-*----------------------------------------+------------------------------------+
-| -partitioner <class>                   | Java Partitioner
-*----------------------------------------+------------------------------------+
-| -reduce <class>                        | Java Reduce class
-*----------------------------------------+------------------------------------+
-| -writer <class>                        | Java RecordWriter
-*----------------------------------------+------------------------------------+
-| -program <executable>                  | Executable URI
-*----------------------------------------+------------------------------------+
-| -reduces <num>                         | Number of reduces
-*----------------------------------------+------------------------------------+
-
-** <<<job>>>
-
-   Command to interact with Map Reduce Jobs.
-
-   Usage: <<<mapred job
-          | [{{{../../hadoop-project-dist/hadoop-common/CommandsManual.html#Generic_Options}GENERIC_OPTIONS}}]
-          | [-submit <job-file>]
-          | [-status <job-id>]
-          | [-counter <job-id> <group-name> <counter-name>]
-          | [-kill <job-id>]
-          | [-events <job-id> <from-event-#> <#-of-events>]
-          | [-history [all] <jobOutputDir>] | [-list [all]]
-          | [-kill-task <task-id>] | [-fail-task <task-id>]
-          | [-set-priority <job-id> <priority>]>>>
-
-*------------------------------+---------------------------------------------+
-|| COMMAND_OPTION              || Description
-*------------------------------+---------------------------------------------+
-| -submit <job-file>           | Submits the job.
-*------------------------------+---------------------------------------------+
-| -status <job-id>             | Prints the map and reduce completion
-                               | percentage and all job counters.
-*------------------------------+---------------------------------------------+
-| -counter <job-id> <group-name> <counter-name> | Prints the counter value.
-*------------------------------+---------------------------------------------+
-| -kill <job-id>               | Kills the job.
-*------------------------------+---------------------------------------------+
-| -events <job-id> <from-event-#> <#-of-events> | Prints the events' details
-                               | received by jobtracker for the given range.
-*------------------------------+---------------------------------------------+
-| -history [all]<jobOutputDir> | Prints job details, failed and killed tip
-                               | details.  More details about the job such as
-                               | successful tasks and task attempts made for
-                               | each task can be viewed by specifying the
-                               | [all] option.
-*------------------------------+---------------------------------------------+
-| -list [all]                  | Displays jobs which are yet to complete.
-                               | <<<-list all>>> displays all jobs.
-*------------------------------+---------------------------------------------+
-| -kill-task <task-id>         | Kills the task. Killed tasks are NOT counted
-                               | against failed attempts.
-*------------------------------+---------------------------------------------+
-| -fail-task <task-id>         | Fails the task. Failed tasks are counted
-                               | against failed attempts.
-*------------------------------+---------------------------------------------+
-| -set-priority <job-id> <priority> | Changes the priority of the job. Allowed
-                               | priority values are VERY_HIGH, HIGH, NORMAL,
-                               | LOW, VERY_LOW
-*------------------------------+---------------------------------------------+
-
-** <<<queue>>>
-
-   command to interact and view Job Queue information
-
-   Usage: <<<mapred queue [-list] | [-info <job-queue-name> [-showJobs]]
-          | [-showacls]>>>
-
-*-----------------+-----------------------------------------------------------+
-|| COMMAND_OPTION || Description
-*-----------------+-----------------------------------------------------------+
-| -list           | Gets list of Job Queues configured in the system.
-                  | Along with scheduling information associated with the job
-                  | queues.
-*-----------------+-----------------------------------------------------------+
-| -info <job-queue-name> [-showJobs] | Displays the job queue information and
-                  | associated scheduling information of particular job queue.
-                  | If <<<-showJobs>>> options is present a list of jobs
-                  | submitted to the particular job queue is displayed.
-*-----------------+-----------------------------------------------------------+
-| -showacls       | Displays the queue name and associated queue operations
-                  | allowed for the current user. The list consists of only
-                  | those queues to which the user has access.
-*-----------------+-----------------------------------------------------------+
-
-** <<<classpath>>>
-
-   Prints the class path needed to get the Hadoop jar and the required
-   libraries.
-
-   Usage: <<<mapred classpath>>>
-
-** <<<distcp>>>
-
-   Copy file or directories recursively. More information can be found at
-   {{{./DistCp.html}Hadoop DistCp Guide}}.
-
-** <<<archive>>>
-
-   Creates a hadoop archive. More information can be found at
-   {{{./HadoopArchives.html}Hadoop Archives Guide}}.
-
-* Administration Commands
-
-   Commands useful for administrators of a hadoop cluster.
-
-** <<<historyserver>>>
-
-   Start JobHistoryServer.
-
-   Usage: <<<mapred historyserver>>>
-
-** <<<hsadmin>>>
-
-   Runs a MapReduce hsadmin client for execute JobHistoryServer administrative
-   commands.
-
-   Usage: <<<mapred hsadmin
-          [-refreshUserToGroupsMappings] |
-          [-refreshSuperUserGroupsConfiguration] |
-          [-refreshAdminAcls] |
-          [-refreshLoadedJobCache] |
-          [-refreshLogRetentionSettings] |
-          [-refreshJobRetentionSettings] |
-          [-getGroups [username]] | [-help [cmd]]>>>
-
-*-----------------+-----------------------------------------------------------+
-|| COMMAND_OPTION || Description
-*-----------------+-----------------------------------------------------------+
-| -refreshUserToGroupsMappings | Refresh user-to-groups mappings
-*-----------------+-----------------------------------------------------------+
-| -refreshSuperUserGroupsConfiguration| Refresh superuser proxy groups mappings
-*-----------------+-----------------------------------------------------------+
-| -refreshAdminAcls | Refresh acls for administration of Job history server
-*-----------------+-----------------------------------------------------------+
-| -refreshLoadedJobCache | Refresh loaded job cache of Job history server
-*-----------------+-----------------------------------------------------------+
-| -refreshJobRetentionSettings|Refresh job history period, job cleaner settings
-*-----------------+-----------------------------------------------------------+
-| -refreshLogRetentionSettings | Refresh log retention period and log retention
-|                              | check interval
-*-----------------+-----------------------------------------------------------+
-| -getGroups [username] | Get the groups which given user belongs to
-*-----------------+-----------------------------------------------------------+
-| -help [cmd] | Displays help for the given command or all commands if none is
-|             | specified.
-*-----------------+-----------------------------------------------------------+

http://git-wip-us.apache.org/repos/asf/hadoop/blob/e75e6c66/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/PluggableShuffleAndPluggableSort.apt.vm
----------------------------------------------------------------------
diff --git a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/PluggableShuffleAndPluggableSort.apt.vm
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/PluggableShuffleAndPluggableSort.apt.vm
deleted file mode 100644
index 06d8022..0000000
--- a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/PluggableShuffleAndPluggableSort.apt.vm
+++ /dev/null
@@ -1,98 +0,0 @@
-~~ Licensed under the Apache License, Version 2.0 (the "License");
-~~ you may not use this file except in compliance with the License.
-~~ You may obtain a copy of the License at
-~~
-~~   http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License. See accompanying LICENSE file.
-
-  ---
-  Hadoop Map Reduce Next Generation-${project.version} - Pluggable Shuffle and Pluggable
Sort
-  ---
-  ---
-  ${maven.build.timestamp}
-
-Hadoop MapReduce Next Generation - Pluggable Shuffle and Pluggable Sort
-
-* Introduction
-
-  The pluggable shuffle and pluggable sort capabilities allow replacing the 
-  built in shuffle and sort logic with alternate implementations. Example use 
-  cases for this are: using a different application protocol other than HTTP 
-  such as RDMA for shuffling data from the Map nodes to the Reducer nodes; or
-  replacing the sort logic with custom algorithms that enable Hash aggregation 
-  and Limit-N query.
-
-  <<IMPORTANT:>> The pluggable shuffle and pluggable sort capabilities are 
-  experimental and unstable. This means the provided APIs may change and break 
-  compatibility in future versions of Hadoop.
-
-* Implementing a Custom Shuffle and a Custom Sort 
-
-  A custom shuffle implementation requires a
-  <<<org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.AuxiliaryService>>>

-  implementation class running in the NodeManagers and a 
-  <<<org.apache.hadoop.mapred.ShuffleConsumerPlugin>>> implementation class
-  running in the Reducer tasks.
-
-  The default implementations provided by Hadoop can be used as references:
-
-    * <<<org.apache.hadoop.mapred.ShuffleHandler>>>
-    
-    * <<<org.apache.hadoop.mapreduce.task.reduce.Shuffle>>>
-
-  A custom sort implementation requires a <<<org.apache.hadoop.mapred.MapOutputCollector>>>
-  implementation class running in the Mapper tasks and (optionally, depending
-  on the sort implementation) a <<<org.apache.hadoop.mapred.ShuffleConsumerPlugin>>>

-  implementation class running in the Reducer tasks.
-
-  The default implementations provided by Hadoop can be used as references:
-
-  * <<<org.apache.hadoop.mapred.MapTask$MapOutputBuffer>>>
-  
-  * <<<org.apache.hadoop.mapreduce.task.reduce.Shuffle>>>
-
-* Configuration
-
-  Except for the auxiliary service running in the NodeManagers serving the 
-  shuffle (by default the <<<ShuffleHandler>>>), all the pluggable components

-  run in the job tasks. This means, they can be configured on per job basis. 
-  The auxiliary service servicing the Shuffle must be configured in the 
-  NodeManagers configuration.
-
-** Job Configuration Properties (on per job basis):
-
-*--------------------------------------+---------------------+-----------------+
-| <<Property>>                         | <<Default Value>>   | <<Explanation>>
|
-*--------------------------------------+---------------------+-----------------+
-| <<<mapreduce.job.reduce.shuffle.consumer.plugin.class>>> | <<<org.apache.hadoop.mapreduce.task.reduce.Shuffle>>>
        | The <<<ShuffleConsumerPlugin>>> implementation to use |
-*--------------------------------------+---------------------+-----------------+
-| <<<mapreduce.job.map.output.collector.class>>>   | <<<org.apache.hadoop.mapred.MapTask$MapOutputBuffer>>>
| The <<<MapOutputCollector>>> implementation(s) to use |
-*--------------------------------------+---------------------+-----------------+
-
-  These properties can also be set in the <<<mapred-site.xml>>> to change
the default values for all jobs.
-
-  The collector class configuration may specify a comma-separated list of collector implementations.
-  In this case, the map task will attempt to instantiate each in turn until one of the
-  implementations successfully initializes. This can be useful if a given collector
-  implementation is only compatible with certain types of keys or values, for example.
-
-** NodeManager Configuration properties, <<<yarn-site.xml>>> in all nodes:
-
-*--------------------------------------+---------------------+-----------------+
-| <<Property>>                         | <<Default Value>>   | <<Explanation>>
|
-*--------------------------------------+---------------------+-----------------+
-| <<<yarn.nodemanager.aux-services>>> | <<<...,mapreduce_shuffle>>>
 | The auxiliary service name |
-*--------------------------------------+---------------------+-----------------+
-| <<<yarn.nodemanager.aux-services.mapreduce_shuffle.class>>>   | <<<org.apache.hadoop.mapred.ShuffleHandler>>>
| The auxiliary service class to use |
-*--------------------------------------+---------------------+-----------------+
-
-  <<IMPORTANT:>> If setting an auxiliary service in addition the default 
-  <<<mapreduce_shuffle>>> service, then a new service key should be added
to the
-  <<<yarn.nodemanager.aux-services>>> property, for example <<<mapred.shufflex>>>.
-  Then the property defining the corresponding class must be
-  <<<yarn.nodemanager.aux-services.mapreduce_shufflex.class>>>.

http://git-wip-us.apache.org/repos/asf/hadoop/blob/e75e6c66/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistributedCacheDeploy.md.vm
----------------------------------------------------------------------
diff --git a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistributedCacheDeploy.md.vm
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistributedCacheDeploy.md.vm
new file mode 100644
index 0000000..36ad8fc
--- /dev/null
+++ b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistributedCacheDeploy.md.vm
@@ -0,0 +1,119 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+#set ( $H3 = '###' )
+#set ( $H4 = '####' )
+#set ( $H5 = '#####' )
+
+Hadoop: Distributed Cache Deploy
+================================
+
+Introduction
+------------
+
+The MapReduce application framework has rudimentary support for deploying a new version of
the MapReduce framework via the distributed cache. By setting the appropriate configuration
properties, users can run a different version of MapReduce than the one initially deployed
to the cluster. For example, cluster administrators can place multiple versions of MapReduce
in HDFS and configure `mapred-site.xml` to specify which version jobs will use by default.
This allows the administrators to perform a rolling upgrade of the MapReduce framework under
certain conditions.
+
+Preconditions and Limitations
+-----------------------------
+
+The support for deploying the MapReduce framework via the distributed cache currently does
not address the job client code used to submit and query jobs. It also does not address the
`ShuffleHandler` code that runs as an auxilliary service within each NodeManager. As a result
the following limitations apply to MapReduce versions that can be successfully deployed via
the distributed cache in a rolling upgrade fashion:
+
+* The MapReduce version must be compatible with the job client code used to
+  submit and query jobs. If it is incompatible then the job client must be
+  upgraded separately on any node from which jobs using the new MapReduce
+  version will be submitted or queried.
+
+* The MapReduce version must be compatible with the configuration files used
+  by the job client submitting the jobs. If it is incompatible with that
+  configuration (e.g.: a new property must be set or an existing property
+  value changed) then the configuration must be updated first.
+
+* The MapReduce version must be compatible with the `ShuffleHandler`
+  version running on the nodes in the cluster. If it is incompatible then the
+  new `ShuffleHandler` code must be deployed to all the nodes in the
+  cluster, and the NodeManagers must be restarted to pick up the new
+  `ShuffleHandler` code.
+
+Deploying a New MapReduce Version via the Distributed Cache
+-----------------------------------------------------------
+
+Deploying a new MapReduce version consists of three steps:
+
+1.  Upload the MapReduce archive to a location that can be accessed by the
+    job submission client. Ideally the archive should be on the cluster's default
+    filesystem at a publicly-readable path. See the archive location discussion
+    below for more details.
+
+2.  Configure `mapreduce.application.framework.path` to point to the
+    location where the archive is located. As when specifying distributed cache
+    files for a job, this is a URL that also supports creating an alias for the
+    archive if a URL fragment is specified. For example,
+    `hdfs:/mapred/framework/hadoop-mapreduce-${project.version}.tar.gz#mrframework`
+    will be localized as `mrframework` rather than
+    `hadoop-mapreduce-${project.version}.tar.gz`.
+
+3.  Configure `mapreduce.application.classpath` to set the proper
+    classpath to use with the MapReduce archive configured above. NOTE: An error
+    occurs if `mapreduce.application.framework.path` is configured but
+    `mapreduce.application.classpath` does not reference the base name of the
+    archive path or the alias if an alias was specified.
+
+$H3 Location of the MapReduce Archive and How It Affects Job Performance
+
+Note that the location of the MapReduce archive can be critical to job submission and job
startup performance. If the archive is not located on the cluster's default filesystem then
it will be copied to the job staging directory for each job and localized to each node where
the job's tasks run. This will slow down job submission and task startup performance.
+
+If the archive is located on the default filesystem then the job client will not upload the
archive to the job staging directory for each job submission. However if the archive path
is not readable by all cluster users then the archive will be localized separately for each
user on each node where tasks execute. This can cause unnecessary duplication in the distributed
cache.
+
+When working with a large cluster it can be important to increase the replication factor
of the archive to increase its availability. This will spread the load when the nodes in the
cluster localize the archive for the first time.
+
+MapReduce Archives and Classpath Configuration
+----------------------------------------------
+
+Setting a proper classpath for the MapReduce archive depends upon the composition of the
archive and whether it has any additional dependencies. For example, the archive can contain
not only the MapReduce jars but also the necessary YARN, HDFS, and Hadoop Common jars and
all other dependencies. In that case, `mapreduce.application.classpath` would be configured
to something like the following example, where the archive basename is hadoop-mapreduce-${project.version}.tar.gz
and the archive is organized internally similar to the standard Hadoop distribution archive:
+
+`$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/lib/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/common/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/common/lib/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/yarn/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/yarn/lib/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/hdfs/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/hdfs/lib/*`
+
+Another possible approach is to have the archive consist of just the MapReduce jars and have
the remaining dependencies picked up from the Hadoop distribution installed on the nodes.
In that case, the above example would change to something like the following:
+
+`$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/lib/*,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*`
+
+$H3 NOTE:
+
+If shuffle encryption is also enabled in the cluster, then we could meet the problem that
MR job get failed with exception like below:
+
+    2014-10-10 02:17:16,600 WARN [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher:
Failed to connect to junpingdu-centos5-3.cs1cloud.internal:13562 with 1 map outputs
+    javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX
path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to
find valid certification path to requested target
+        at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174)
+        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1731)
+        at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:241)
+        at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:235)
+        at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1206)
+        at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:136)
+        at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:593)
+        at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Handshaker.java:529)
+        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:925)
+        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1170)
+        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1197)
+        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1181)
+        at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:434)
+        at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:81)
+        at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:61)
+        at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:584)
+        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1193)
+        at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
+        at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318)
+        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:427)
+    ....
+
+This is because MR client (deployed from HDFS) cannot access ssl-client.xml in local FS under
directory of $HADOOP\_CONF\_DIR. To fix the problem, we can add the directory with ssl-client.xml
to the classpath of MR which is specified in "mapreduce.application.classpath" as mentioned
above. To avoid MR application being affected by other local configurations, it is better
to create a dedicated directory for putting ssl-client.xml, e.g. a sub-directory under $HADOOP\_CONF\_DIR,
like: $HADOOP\_CONF\_DIR/security.

http://git-wip-us.apache.org/repos/asf/hadoop/blob/e75e6c66/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/EncryptedShuffle.md
----------------------------------------------------------------------
diff --git a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/EncryptedShuffle.md
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/EncryptedShuffle.md
new file mode 100644
index 0000000..58fd52a
--- /dev/null
+++ b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/EncryptedShuffle.md
@@ -0,0 +1,255 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+Hadoop: Encrypted Shuffle
+=========================
+
+Introduction
+------------
+
+The Encrypted Shuffle capability allows encryption of the MapReduce shuffle using HTTPS and
with optional client authentication (also known as bi-directional HTTPS, or HTTPS with client
certificates). It comprises:
+
+*   A Hadoop configuration setting for toggling the shuffle between HTTP and
+    HTTPS.
+
+*   A Hadoop configuration settings for specifying the keystore and truststore
+    properties (location, type, passwords) used by the shuffle service and the
+    reducers tasks fetching shuffle data.
+
+*   A way to re-load truststores across the cluster (when a node is added or
+    removed).
+
+Configuration
+-------------
+
+### **core-site.xml** Properties
+
+To enable encrypted shuffle, set the following properties in core-site.xml of all nodes in
the cluster:
+
+| **Property** | **Default Value** | **Explanation** |
+|:---- |:---- |:---- |
+| `hadoop.ssl.require.client.cert` | `false` | Whether client certificates are required |
+| `hadoop.ssl.hostname.verifier` | `DEFAULT` | The hostname verifier to provide for HttpsURLConnections.
Valid values are: **DEFAULT**, **STRICT**, **STRICT\_I6**, **DEFAULT\_AND\_LOCALHOST** and
**ALLOW\_ALL** |
+| `hadoop.ssl.keystores.factory.class` | `org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory`
| The KeyStoresFactory implementation to use |
+| `hadoop.ssl.server.conf` | `ssl-server.xml` | Resource file from which ssl server keystore
information will be extracted. This file is looked up in the classpath, typically it should
be in Hadoop conf/ directory |
+| `hadoop.ssl.client.conf` | `ssl-client.xml` | Resource file from which ssl server keystore
information will be extracted. This file is looked up in the classpath, typically it should
be in Hadoop conf/ directory |
+| `hadoop.ssl.enabled.protocols` | `TLSv1` | The supported SSL protocols (JDK6 can use **TLSv1**,
JDK7+ can use **TLSv1,TLSv1.1,TLSv1.2**) |
+
+**IMPORTANT:** Currently requiring client certificates should be set to false. Refer the
[Client Certificates](#Client_Certificates) section for details.
+
+**IMPORTANT:** All these properties should be marked as final in the cluster configuration
files.
+
+#### Example:
+
+```xml
+  <property>
+    <name>hadoop.ssl.require.client.cert</name>
+    <value>false</value>
+    <final>true</final>
+  </property>
+
+  <property>
+    <name>hadoop.ssl.hostname.verifier</name>
+    <value>DEFAULT</value>
+    <final>true</final>
+  </property>
+
+  <property>
+    <name>hadoop.ssl.keystores.factory.class</name>
+    <value>org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory</value>
+    <final>true</final>
+  </property>
+
+  <property>
+    <name>hadoop.ssl.server.conf</name>
+    <value>ssl-server.xml</value>
+    <final>true</final>
+  </property>
+
+  <property>
+    <name>hadoop.ssl.client.conf</name>
+    <value>ssl-client.xml</value>
+    <final>true</final>
+  </property>
+```
+
+### `mapred-site.xml` Properties
+
+To enable encrypted shuffle, set the following property in mapred-site.xml of all nodes in
the cluster:
+
+| **Property** | **Default Value** | **Explanation** |
+|:---- |:---- |:---- |
+| `mapreduce.shuffle.ssl.enabled` | `false` | Whether encrypted shuffle is enabled |
+
+**IMPORTANT:** This property should be marked as final in the cluster configuration files.
+
+#### Example:
+
+```xml
+  <property>
+    <name>mapreduce.shuffle.ssl.enabled</name>
+    <value>true</value>
+    <final>true</final>
+  </property>
+```
+
+The Linux container executor should be set to prevent job tasks from reading the server keystore
information and gaining access to the shuffle server certificates.
+
+Refer to Hadoop Kerberos configuration for details on how to do this.
+
+Keystore and Truststore Settings
+--------------------------------
+
+Currently `FileBasedKeyStoresFactory` is the only `KeyStoresFactory` implementation. The
`FileBasedKeyStoresFactory` implementation uses the following properties, in the **ssl-server.xml**
and **ssl-client.xml** files, to configure the keystores and truststores.
+
+### `ssl-server.xml` (Shuffle server) Configuration:
+
+The mapred user should own the **ssl-server.xml** file and have exclusive read access to
it.
+
+| **Property** | **Default Value** | **Explanation** |
+|:---- |:---- |:---- |
+| `ssl.server.keystore.type` | `jks` | Keystore file type |
+| `ssl.server.keystore.location` | NONE | Keystore file location. The mapred user should
own this file and have exclusive read access to it. |
+| `ssl.server.keystore.password` | NONE | Keystore file password |
+| `ssl.server.truststore.type` | `jks` | Truststore file type |
+| `ssl.server.truststore.location` | NONE | Truststore file location. The mapred user should
own this file and have exclusive read access to it. |
+| `ssl.server.truststore.password` | NONE | Truststore file password |
+| `ssl.server.truststore.reload.interval` | 10000 | Truststore reload interval, in milliseconds
|
+
+#### Example:
+
+```xml
+<configuration>
+
+  <!-- Server Certificate Store -->
+  <property>
+    <name>ssl.server.keystore.type</name>
+    <value>jks</value>
+  </property>
+  <property>
+    <name>ssl.server.keystore.location</name>
+    <value>${user.home}/keystores/server-keystore.jks</value>
+  </property>
+  <property>
+    <name>ssl.server.keystore.password</name>
+    <value>serverfoo</value>
+  </property>
+
+  <!-- Server Trust Store -->
+  <property>
+    <name>ssl.server.truststore.type</name>
+    <value>jks</value>
+  </property>
+  <property>
+    <name>ssl.server.truststore.location</name>
+    <value>${user.home}/keystores/truststore.jks</value>
+  </property>
+  <property>
+    <name>ssl.server.truststore.password</name>
+    <value>clientserverbar</value>
+  </property>
+  <property>
+    <name>ssl.server.truststore.reload.interval</name>
+    <value>10000</value>
+  </property>
+</configuration>
+```
+
+### `ssl-client.xml` (Reducer/Fetcher) Configuration:
+
+The mapred user should own the **ssl-client.xml** file and it should have default permissions.
+
+| **Property** | **Default Value** | **Explanation** |
+|:---- |:---- |:---- |
+| `ssl.client.keystore.type` | `jks` | Keystore file type |
+| `ssl.client.keystore.location` | NONE | Keystore file location. The mapred user should
own this file and it should have default permissions. |
+| `ssl.client.keystore.password` | NONE | Keystore file password |
+| `ssl.client.truststore.type` | `jks` | Truststore file type |
+| `ssl.client.truststore.location` | NONE | Truststore file location. The mapred user should
own this file and it should have default permissions. |
+| `ssl.client.truststore.password` | NONE | Truststore file password |
+| `ssl.client.truststore.reload.interval` | 10000 | Truststore reload interval, in milliseconds
|
+
+#### Example:
+
+```xml
+<configuration>
+
+  <!-- Client certificate Store -->
+  <property>
+    <name>ssl.client.keystore.type</name>
+    <value>jks</value>
+  </property>
+  <property>
+    <name>ssl.client.keystore.location</name>
+    <value>${user.home}/keystores/client-keystore.jks</value>
+  </property>
+  <property>
+    <name>ssl.client.keystore.password</name>
+    <value>clientfoo</value>
+  </property>
+
+  <!-- Client Trust Store -->
+  <property>
+    <name>ssl.client.truststore.type</name>
+    <value>jks</value>
+  </property>
+  <property>
+    <name>ssl.client.truststore.location</name>
+    <value>${user.home}/keystores/truststore.jks</value>
+  </property>
+  <property>
+    <name>ssl.client.truststore.password</name>
+    <value>clientserverbar</value>
+  </property>
+  <property>
+    <name>ssl.client.truststore.reload.interval</name>
+    <value>10000</value>
+  </property>
+</configuration>
+```
+
+Activating Encrypted Shuffle
+----------------------------
+
+When you have made the above configuration changes, activate Encrypted Shuffle by re-starting
all NodeManagers.
+
+**IMPORTANT:** Using encrypted shuffle will incur in a significant performance impact. Users
should profile this and potentially reserve 1 or more cores for encrypted shuffle.
+
+Client Certificates
+-------------------
+
+Using Client Certificates does not fully ensure that the client is a reducer task for the
job. Currently, Client Certificates (their private key) keystore files must be readable by
all users submitting jobs to the cluster. This means that a rogue job could read such those
keystore files and use the client certificates in them to establish a secure connection with
a Shuffle server. However, unless the rogue job has a proper JobToken, it won't be able to
retrieve shuffle data from the Shuffle server. A job, using its own JobToken, can only retrieve
shuffle data that belongs to itself.
+
+Reloading Truststores
+---------------------
+
+By default the truststores will reload their configuration every 10 seconds. If a new truststore
file is copied over the old one, it will be re-read, and its certificates will replace the
old ones. This mechanism is useful for adding or removing nodes from the cluster, or for adding
or removing trusted clients. In these cases, the client or NodeManager certificate is added
to (or removed from) all the truststore files in the system, and the new configuration will
be picked up without you having to restart the NodeManager daemons.
+
+Debugging
+---------
+
+**NOTE:** Enable debugging only for troubleshooting, and then only for jobs running on small
amounts of data. It is very verbose and slows down jobs by several orders of magnitude. (You
might need to increase mapred.task.timeout to prevent jobs from failing because tasks run
so slowly.)
+
+To enable SSL debugging in the reducers, set `-Djavax.net.debug=all` in the `mapreduce.reduce.child.java.opts`
property; for example:
+
+      <property>
+        <name>mapred.reduce.child.java.opts</name>
+        <value>-Xmx-200m -Djavax.net.debug=all</value>
+      </property>
+
+You can do this on a per-job basis, or by means of a cluster-wide setting in the `mapred-site.xml`
file.
+
+To set this property in NodeManager, set it in the `yarn-env.sh` file:
+
+      YARN_NODEMANAGER_OPTS="-Djavax.net.debug=all"


Mime
View raw message