Return-Path: Delivered-To: apmail-hadoop-mapreduce-commits-archive@minotaur.apache.org Received: (qmail 93763 invoked from network); 1 Apr 2010 20:46:29 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 1 Apr 2010 20:46:29 -0000 Received: (qmail 10685 invoked by uid 500); 1 Apr 2010 20:46:29 -0000 Delivered-To: apmail-hadoop-mapreduce-commits-archive@hadoop.apache.org Received: (qmail 10629 invoked by uid 500); 1 Apr 2010 20:46:29 -0000 Mailing-List: contact mapreduce-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-dev@hadoop.apache.org Delivered-To: mailing list mapreduce-commits@hadoop.apache.org Received: (qmail 10612 invoked by uid 99); 1 Apr 2010 20:46:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Apr 2010 20:46:28 +0000 X-ASF-Spam-Status: No, hits=-1151.0 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Apr 2010 20:46:27 +0000 Received: by eris.apache.org (Postfix, from userid 65534) id 25AA523888EC; Thu, 1 Apr 2010 20:46:07 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r930088 - in /hadoop/mapreduce/trunk: CHANGES.txt src/docs/src/documentation/content/xdocs/hadoop_archives.xml Date: Thu, 01 Apr 2010 20:46:07 -0000 To: mapreduce-commits@hadoop.apache.org From: szetszwo@apache.org X-Mailer: svnmailer-1.0.8 Message-Id: <20100401204607.25AA523888EC@eris.apache.org> Author: szetszwo Date: Thu Apr 1 20:46:06 2010 New Revision: 930088 URL: http://svn.apache.org/viewvc?rev=930088&view=rev Log: MAPREDUCE-1514. Add documentation on replication, permissions, new options, limitations and internals of har. Contributed by mahadev Modified: hadoop/mapreduce/trunk/CHANGES.txt hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml Modified: hadoop/mapreduce/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/CHANGES.txt?rev=930088&r1=930087&r2=930088&view=diff ============================================================================== --- hadoop/mapreduce/trunk/CHANGES.txt (original) +++ hadoop/mapreduce/trunk/CHANGES.txt Thu Apr 1 20:46:06 2010 @@ -235,6 +235,9 @@ Trunk (unreleased changes) MAPREDUCE-1489. DataDrivenDBInputFormat should not query the database when generating only one split. (Aaron Kimball via tomwhite) + MAPREDUCE-1514. Add documentation on replication, permissions, new options, + limitations and internals of har. (mahadev via szetszwo) + OPTIMIZATIONS MAPREDUCE-270. Fix the tasktracker to optionally send an out-of-band Modified: hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml?rev=930088&r1=930087&r2=930088&view=diff ============================================================================== --- hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml (original) +++ hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/hadoop_archives.xml Thu Apr 1 20:46:06 2010 @@ -17,38 +17,40 @@ --> -
- Archives Guide -
- -
- What are Hadoop archives? -

- Hadoop archives are special format archives. A Hadoop archive - maps to a file system directory. A Hadoop archive always has a *.har - extension. A Hadoop archive directory contains metadata (in the form - of _index and _masterindex) and data (part-*) files. The _index file contains - the name of the files that are part of the archive and the location - within the part files. -

-
- -
- How to create an archive? -

- Usage: hadoop archive -archiveName name -p <parent> <src>* <dest> +

+ Hadoop Archive Guide +
+ +
+ What are Hadoop archives? +

Hadoop archives are special format archives. The main use case of + using archives is to reduce the namespace of the NameNode. Hadoop + Archives collapses a set of files into a smaller number of files and + provides a very efficient and easy interface to access these + collapsed files. A Hadoop archive maps to a HDFS directory. A Hadoop + archive always has a *.har extension.

+
+
+ How to create an archive? +

+ Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>

-archiveName is the name of the archive you would like to create. An example would be foo.har. The name should have a *.har extension. - The parent argument is to specify the relative path to which the files should be + The parent argument is to specify the relative path to which the files + should be archived to. Example would be :

-p /foo/bar a/b/c e/f/g

- Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to parent. + Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to + parent. Note that this is a Map/Reduce job that creates the archives. You would - need a map reduce cluster to run this. For a detailed example the later sections.

-

If you just want to archive a single directory /foo/bar then you can just use

-

hadoop archive -archiveName zoo.har -p /foo/bar /outputdir

+ need a map reduce cluster to run this. For a detailed example the later + sections.

+

If you just want to archive a single directory /foo/bar then you + can just use

+

hadoop archive -archiveName zoo.har -p /foo/bar /outputdir +

@@ -58,7 +60,8 @@ commands in the archives work but with a different URI. Also, note that archives are immutable. So, rename's, deletes and creates return an error. URI for Hadoop Archives is -

har://scheme-hostname:port/archivepath/fileinarchive

+

har://scheme-hostname:port/archivepath/fileinarchive +

If no scheme is provided it assumes the underlying filesystem. In that case the URI would look like

har:///archivepath/fileinarchive

@@ -66,22 +69,27 @@
Example on creating and looking up archives -

hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo

+

hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 + /user/zoo

- The above example is creating an archive using /user/hadoop as the relative archive directory. + The above example is creating an archive using /user/hadoop as the + relative archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be - archived in the following file system directory -- /user/zoo/foo.har. Archiving does not delete the input - files. If you want to delete the input files after creating the archives (to reduce namespace), you + archived in the following file system directory -- /user/zoo/foo.har. + Archiving does not delete the input files. If you want to delete + the input files after creating the archives (to reduce namespace), you will have to do it on your own.

Looking up files and understanding the -p option -

Looking up files in hadoop archives is as easy as doing an ls on the filesystem. After you have - archived the directories /user/hadoop/dir1 and /user/hadoop/dir2 as in the exmaple above, to see all - the files in the archives you can just run:

+

Looking up files in hadoop archives is as easy as doing an ls on the + filesystem. After you have archived the directories /user/hadoop/dir1 and + /user/hadoop/dir2 as in the exmaple above, to see all the files in the + archives you can just run:

hadoop dfs -lsr har:///user/zoo/foo.har/

-

To understand the significance of the -p argument, lets go through the above example again. If you just do +

To understand the significance of the -p argument, lets go through the + above example again. If you just do an ls (not lsr) on the hadoop archive using

hadoop dfs -ls har:///user/zoo/foo.har

The output should be:

@@ -90,9 +98,11 @@ har:///user/zoo/foo.har/dir1 har:///user/zoo/foo.har/dir2

As you can recall the archives were created with the following command

-

hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo

+

hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 + /user/zoo

If we were to change the command to:

-

hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo

+

hadoop archive -archiveName foo.har -p /user/ + hadoop/dir1 hadoop/dir2 /user/zoo

then a ls on the hadoop archive using

hadoop dfs -ls har:///user/zoo/foo.har

would give you

@@ -101,17 +111,87 @@ har:///user/zoo/foo.har/hadoop/dir1 har:///user/zoo/foo.har/hadoop/dir2

- Notice that the archived files have been archived relative to /user/ rather than /user/hadoop. + Notice that the archived files have been archived relative to + /user/ rather than /user/hadoop.

- Using Hadoop Archives with Map Reduce -

Using Hadoop Archives in Map Reduce is as easy as specifying a different input filesystem than the default file system. - If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this archive for Map Reduce input, all - you need to specify the input directory as har:///user/zoo/foo.har. Since Hadoop Archives is exposed as a file system - Map Reduce will be able to use all the logical input files in Hadoop Archives as input.

+ Using Hadoop Archive with Map Reduce +

Using Hadoop Archive in Map Reduce is as easy as specifying a + different input filesystem than the default file system. + If you have a hadoop archive stored in HDFS in /user/zoo/foo.har + then for using this archive for Map Reduce input, all + you need to specify the input directory as har:///user/zoo/foo.har. + Since Hadoop Archives is exposed as a file system Map Reduce will be able + to use all the logical input files in Hadoop Archives as input.

+
+ +
+ File Replication and Permissions of Hadoop Archive +

+ Hadoop Archive currently does not store the file information metadata + that the files had before they were archived. The file permissions of + Hadoop Archive created is the default permissions that a user creates + file with. The file replication of data files in Hadoop Archive is set to + 3 and the metadata information files have a replication factor of 5. You + can increase this by increasing the replication factor of files under the har + file directory. + On restoration of hadoop archive files using something like: +

+

+ distcp har:///path_to_har dest_path +

+

+ the restored files will not have the permissions/replication of the original + files that were archived. +

+
+ +
+ Creating different block size and part size Hadoop Archive +

+ You can create different hadoop block size and part size using the following + options: +

+ bin/hadoop archive -Dhar.block.size=512 -Dhar.partfile.size=1024 -archiveName ... +

+

+ The above example allows you to set a block size of 512 bytes for part files + and part file size of 1K. These numbers are only as examples. Using such low + number for block size and part file size is not advisable at all! +

+ +
+ Limitations of Hadoop Archive +

+ Currently Hadoop archive do not support input paths with spaces in it. It + throws out an exception in such a case. You can create archives with names + in which space can be replaced by a valid character. + Below is an example: +

+

+ + bin/hadoop archive -Dhar.space.replacement.enable=true -Dhar.space.replacement="_" + -archiveName ...... + +

+

The above example replaces space with "_" in the archived file names. +

+
+ +
+ Internals of Hadoop Archive +

+ A Hadoop Archive directory contains metadata (in the form + of _index and _masterindex) and data (part-*) files. The _index file contains + the name of the files that are part of the archive and the location + within the part files. The _masterindex file stores offsets into the _index + file to make it easier to seek into the _index file for faster lookups. +

+
+