From jeag...@apache.org
Subject svn commit: r1591107 - in /hadoop/common/trunk/hadoop-mapreduce-project: CHANGES.txt hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm
Date Tue, 29 Apr 2014 21:23:51 GMT
Author: jeagles
Date: Tue Apr 29 21:23:50 2014
New Revision: 1591107

URL: http://svn.apache.org/r1591107
MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via jeagles)


@@ -175,6 +175,9 @@ Release 2.5.0 - UNRELEASED
     MAPREDUCE-5812. Make job context available to
     OutputCommitter.isRecoverySupported() (Mohammad Kamrul Islam via jlowe)
+    MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via
+    jeagles)

+Hadoop Archives Guide
+ - [Overview](#Overview)
+ - [How to Create an Archive](#How_to_Create_an_Archive)
+ - [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives)
+ - [Archives Examples](#Archives_Examples)
+     - [Creating an Archive](#Creating_an_Archive)
+     - [Looking Up Files](#Looking_Up_Files)
+ - [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce)
+  Hadoop archives are special format archives. A Hadoop archive maps to a file
+  system directory. A Hadoop archive always has a \*.har extension. A Hadoop
+  archive directory contains metadata (in the form of _index and _masterindex)
+  and data (part-\*) files. The _index file contains the name of the files that
+  are part of the archive and the location within the part files.
+How to Create an Archive
+  `Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>`
+  -archiveName is the name of the archive you would like to create. An example
+  would be foo.har. The name should have a \*.har extension. The parent argument
+  is to specify the relative path to which the files should be archived to.
+  Example would be :
+  `-p /foo/bar a/b/c e/f/g`
+  Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
+  parent. Note that this is a Map/Reduce job that creates the archives. You
+  would need a map reduce cluster to run this. For a detailed example the later
+  sections.
+  If you just want to archive a single directory /foo/bar then you can just use
+  `hadoop archive -archiveName zoo.har -p /foo/bar /outputdir`
+How to Look Up Files in Archives
+  The archive exposes itself as a file system layer. So all the fs shell
+  commands in the archives work but with a different URI. Also, note that
+  archives are immutable. So, rename's, deletes and creates return an error.
+  URI for Hadoop Archives is
+  `har://scheme-hostname:port/archivepath/fileinarchive`
+  If no scheme is provided it assumes the underlying filesystem. In that case
+  the URI would look like
+  `har:///archivepath/fileinarchive`
+Archives Examples
+$H3 Creating an Archive
+  `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
+  The above example is creating an archive using /user/hadoop as the relative
+  archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2
+  will be archived in the following file system directory -- /user/zoo/foo.har.
+  Archiving does not delete the input files. If you want to delete the input
+  files after creating the archives (to reduce namespace), you will have to do
+  it on your own. 
+$H3 Looking Up Files
+  Looking up files in hadoop archives is as easy as doing an ls on the
+  filesystem. After you have archived the directories /user/hadoop/dir1 and
+  /user/hadoop/dir2 as in the example above, to see all the files in the
+  archives you can just run:
+  `hdfs dfs -ls -R har:///user/zoo/foo.har/`
+  To understand the significance of the -p argument, lets go through the above
+  example again. If you just do an ls (not lsr) on the hadoop archive using
+  `hdfs dfs -ls har:///user/zoo/foo.har`
+  The output should be:
+  As you can recall the archives were created with the following command
+  `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
+  If we were to change the command to:
+  `hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo`
+  then a ls on the hadoop archive using
+  `hdfs dfs -ls har:///user/zoo/foo.har`
+  would give you
+  Notice that the archived files have been archived relative to /user/ rather
+  than /user/hadoop.
+Hadoop Archives and MapReduce
+  Using Hadoop Archives in MapReduce is as easy as specifying a different input
+  filesystem than the default file system. If you have a hadoop archive stored
+  in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input,
+  all you need to specify the input directory as har:///user/zoo/foo.har. Since
+  Hadoop Archives is exposed as a file system MapReduce will be able to use all
+  the logical input files in Hadoop Archives as input.

