Return-Path: X-Original-To: apmail-hadoop-mapreduce-commits-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DBEC110208 for ; Tue, 29 Apr 2014 21:24:19 +0000 (UTC) Received: (qmail 19129 invoked by uid 500); 29 Apr 2014 21:24:17 -0000 Delivered-To: apmail-hadoop-mapreduce-commits-archive@hadoop.apache.org Received: (qmail 19002 invoked by uid 500); 29 Apr 2014 21:24:16 -0000 Mailing-List: contact mapreduce-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-dev@hadoop.apache.org Delivered-To: mailing list mapreduce-commits@hadoop.apache.org Received: (qmail 18994 invoked by uid 99); 29 Apr 2014 21:24:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Apr 2014 21:24:16 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Apr 2014 21:24:15 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id 302072388860; Tue, 29 Apr 2014 21:23:51 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1591107 - in /hadoop/common/trunk/hadoop-mapreduce-project: CHANGES.txt hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm Date: Tue, 29 Apr 2014 21:23:51 -0000 To: mapreduce-commits@hadoop.apache.org From: jeagles@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20140429212351.302072388860@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: jeagles Date: Tue Apr 29 21:23:50 2014 New Revision: 1591107 URL: http://svn.apache.org/r1591107 Log: MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via jeagles) Added: hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm Modified: hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt Modified: hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt?rev=1591107&r1=1591106&r2=1591107&view=diff ============================================================================== --- hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt (original) +++ hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt Tue Apr 29 21:23:50 2014 @@ -175,6 +175,9 @@ Release 2.5.0 - UNRELEASED MAPREDUCE-5812. Make job context available to OutputCommitter.isRecoverySupported() (Mohammad Kamrul Islam via jlowe) + MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via + jeagles) + OPTIMIZATIONS BUG FIXES Added: hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm?rev=1591107&view=auto ============================================================================== --- hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm (added) +++ hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm Tue Apr 29 21:23:50 2014 @@ -0,0 +1,138 @@ + + +#set ( $H3 = '###' ) + +Hadoop Archives Guide +===================== + + - [Overview](#Overview) + - [How to Create an Archive](#How_to_Create_an_Archive) + - [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives) + - [Archives Examples](#Archives_Examples) + - [Creating an Archive](#Creating_an_Archive) + - [Looking Up Files](#Looking_Up_Files) + - [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce) + +Overview +-------- + + Hadoop archives are special format archives. A Hadoop archive maps to a file + system directory. A Hadoop archive always has a \*.har extension. A Hadoop + archive directory contains metadata (in the form of _index and _masterindex) + and data (part-\*) files. The _index file contains the name of the files that + are part of the archive and the location within the part files. + +How to Create an Archive +------------------------ + + `Usage: hadoop archive -archiveName name -p * ` + + -archiveName is the name of the archive you would like to create. An example + would be foo.har. The name should have a \*.har extension. The parent argument + is to specify the relative path to which the files should be archived to. + Example would be : + + `-p /foo/bar a/b/c e/f/g` + + Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to + parent. Note that this is a Map/Reduce job that creates the archives. You + would need a map reduce cluster to run this. For a detailed example the later + sections. + + If you just want to archive a single directory /foo/bar then you can just use + + `hadoop archive -archiveName zoo.har -p /foo/bar /outputdir` + +How to Look Up Files in Archives +-------------------------------- + + The archive exposes itself as a file system layer. So all the fs shell + commands in the archives work but with a different URI. Also, note that + archives are immutable. So, rename's, deletes and creates return an error. + URI for Hadoop Archives is + + `har://scheme-hostname:port/archivepath/fileinarchive` + + If no scheme is provided it assumes the underlying filesystem. In that case + the URI would look like + + `har:///archivepath/fileinarchive` + +Archives Examples +----------------- + +$H3 Creating an Archive + + `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo` + + The above example is creating an archive using /user/hadoop as the relative + archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2 + will be archived in the following file system directory -- /user/zoo/foo.har. + Archiving does not delete the input files. If you want to delete the input + files after creating the archives (to reduce namespace), you will have to do + it on your own. + +$H3 Looking Up Files + + Looking up files in hadoop archives is as easy as doing an ls on the + filesystem. After you have archived the directories /user/hadoop/dir1 and + /user/hadoop/dir2 as in the example above, to see all the files in the + archives you can just run: + + `hdfs dfs -ls -R har:///user/zoo/foo.har/` + + To understand the significance of the -p argument, lets go through the above + example again. If you just do an ls (not lsr) on the hadoop archive using + + `hdfs dfs -ls har:///user/zoo/foo.har` + + The output should be: + +``` +har:///user/zoo/foo.har/dir1 +har:///user/zoo/foo.har/dir2 +``` + + As you can recall the archives were created with the following command + + `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo` + + If we were to change the command to: + + `hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo` + + then a ls on the hadoop archive using + + `hdfs dfs -ls har:///user/zoo/foo.har` + + would give you + +``` +har:///user/zoo/foo.har/hadoop/dir1 +har:///user/zoo/foo.har/hadoop/dir2 +``` + + Notice that the archived files have been archived relative to /user/ rather + than /user/hadoop. + +Hadoop Archives and MapReduce +----------------------------- + + Using Hadoop Archives in MapReduce is as easy as specifying a different input + filesystem than the default file system. If you have a hadoop archive stored + in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input, + all you need to specify the input directory as har:///user/zoo/foo.har. Since + Hadoop Archives is exposed as a file system MapReduce will be able to use all + the logical input files in Hadoop Archives as input.