hadoop-mapreduce-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cdoug...@apache.org
Subject svn commit: r909785 - in /hadoop/mapreduce/trunk: CHANGES.txt src/docs/src/documentation/content/xdocs/hadoop_archives.xml
Date Sat, 13 Feb 2010 10:36:01 GMT
Author: cdouglas
Date: Sat Feb 13 10:36:00 2010
New Revision: 909785

URL: http://svn.apache.org/viewvc?rev=909785&view=rev
MAPREDUCE-1474. Update forrest documentation for Hadoop Archives. Contributed by Mahadev Konar


Modified: hadoop/mapreduce/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/CHANGES.txt?rev=909785&r1=909784&r2=909785&view=diff
--- hadoop/mapreduce/trunk/CHANGES.txt (original)
+++ hadoop/mapreduce/trunk/CHANGES.txt Sat Feb 13 10:36:00 2010
@@ -329,6 +329,9 @@
     MAPREDUCE-1305. Improve efficiency of distcp -delete. (Peter Romianowski
     via cdouglas)
+    MAPREDUCE-1474. Update forrest documentation for Hadoop Archives. (Mahadev
+    Konar via cdouglas)
 Release 0.21.0 - Unreleased

Modified: hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml?rev=909785&r1=909784&r2=909785&view=diff
--- hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml (original)
+++ hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml Sat
Feb 13 10:36:00 2010
@@ -18,11 +18,11 @@
 <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
-        <title>Hadoop Archives Guide</title>
+        <title>Archives Guide</title>
-        <title>Overview</title>
+        <title> What are Hadoop archives? </title>
         Hadoop archives are special format archives. A Hadoop archive
         maps to a file system directory. A Hadoop archive always has a *.har
@@ -34,49 +34,84 @@
-        <title> How to Create an Archive </title>
+        <title> How to create an archive? </title>
-        <code>Usage: hadoop archive -archiveName name &lt;src&gt;* &lt;dest&gt;</code>
+        <code>Usage: hadoop archive -archiveName name -p &lt;parent&gt; &lt;src&gt;*
         -archiveName is the name of the archive you would like to create. 
         An example would be foo.har. The name should have a *.har extension. 
-        The inputs are file system pathnames which work as usual with regular
-        expressions. The destination directory would contain the archive.
-        Note that this is a MapReduce job that creates the archives. You would
-        need a MapReduce cluster to run this. The following is an example:</p>
-        <p>
-        <code>hadoop archive -archiveName foo.har /user/hadoop/dir1 /user/hadoop/dir2
-        </p><p>
-        In the above example /user/hadoop/dir1 and /user/hadoop/dir2 will be
-        archived in the following file system directory -- /user/zoo/foo.har.
-        The sources are not changed or removed when an archive is created.
-        </p>
+       	The parent argument is to specify the relative path to which the files should be
+       	archived to. Example would be :
+        </p><p><code> -p /foo/bar a/b/c e/f/g </code></p><p>
+        Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to parent. 
+        Note that this is a Map/Reduce job that creates the archives. You would
+        need a map reduce cluster to run this. For a detailed example the later sections.
+        <p> If you just want to archive a single directory /foo/bar then you can just
use </p>
+        <p><code> hadoop archive -archiveName zoo.har -p /foo/bar /outputdir
-        <title> How to Look Up Files in Archives </title>
+        <title> How to look up files in archives? </title>
         The archive exposes itself as a file system layer. So all the fs shell
         commands in the archives work but with a different URI. Also, note that
-        archives are immutable. So, rename, delete and create will return
-        an error. The URI for Hadoop Archives is:
+        archives are immutable. So, rename's, deletes and creates return
+        an error. URI for Hadoop Archives is 
         If no scheme is provided it assumes the underlying filesystem. 
-        In that case the URI would look like this:
-        </p><p><code>
-        har:///archivepath/fileinarchive</code></p>
-        <p>
-        Here is an example of archive. The input to the archives is /dir. The directory dir
-        files filea, fileb. To archive /dir to /user/hadoop/foo.har, the command is: 
-        </p>
-        <p><code>hadoop archive -archiveName foo.har /dir /user/hadoop</code>
-        </p><p>
-        To get file listing for files in the created archive: 
+        In that case the URI would look like </p>
+        <p><code>har:///archivepath/fileinarchive</code></p>
+        </section>
+ 		<section>
+ 		<title> Example on creating and looking up archives </title>
+        <p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2
/user/zoo </code></p>
+        <p>
+         The above example is creating an archive using /user/hadoop as the relative archive
+         The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be 
+        archived in the following file system directory -- /user/zoo/foo.har. Archiving does
not delete the input
+        files. If you want to delete the input files after creating the archives (to reduce
namespace), you
+        will have to do it on your own. 
-        <p><code>hadoop dfs -lsr har:///user/hadoop/foo.har</code></p>
-        <p>To cat filea in archive:
-        </p><p><code>hadoop dfs -cat har:///user/hadoop/foo.har/dir/filea</code></p>
+        <section>
+        <title> Looking up files and understanding the -p option </title>
+		 <p> Looking up files in hadoop archives is as easy as doing an ls on the filesystem.
After you have
+		 archived the directories /user/hadoop/dir1 and /user/hadoop/dir2 as in the exmaple above,
to see all
+		 the files in the archives you can just run: </p>
+		 <p><code>hadoop dfs -lsr har:///user/zoo/foo.har/</code></p>
+		 <p> To understand the significance of the -p argument, lets go through the above
example again. If you just do
+		 an ls (not lsr) on the hadoop archive using </p>
+		 <p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p>
+		 <p>The output should be:</p>
+		 <source>
+		 </source>
+		 <p> As you can recall the archives were created with the following command </p>
+        <p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2
/user/zoo </code></p>
+        <p> If we were to change the command to: </p>
+        <p><code>hadoop archive -archiveName foo.har -p /user/  hadoop/dir1 hadoop/dir2
/user/zoo </code></p>
+        <p> then a ls on the hadoop archive using </p>
+        <p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p>
+        <p>would give you</p>
+        <source>
+		</source>
+		<p>
+		Notice that the archived files have been archived relative to /user/ rather than /user/hadoop.
+		</p>
+		</section>
+		</section>
+		<section>
+		<title> Using Hadoop Archives with Map Reduce </title> 
+		<p>Using Hadoop Archives in Map Reduce is as easy as specifying a different input
filesystem than the default file system.
+		If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this archive
for Map Reduce input, all
+		you need to specify the input directory as har:///user/zoo/foo.har. Since Hadoop Archives
is exposed as a file system 
+		Map Reduce will be able to use all the logical input files in Hadoop Archives as input.</p>
-	</body>
+  </body>

View raw message