Archives Guide

- How to create an archive? -

- Usage: hadoop archive -archiveName name -p <parent> <src>* <dest> +

+ Hadoop Archive Guide +

+ +

+ What are Hadoop archives? +

Hadoop archives are special format archives. The main use case of + using archives is to reduce the namespace of the NameNode. Hadoop + Archives collapses a set of files into a smaller number of files and + provides a very efficient and easy interface to access these + collapsed files. A Hadoop archive maps to a HDFS directory. A Hadoop + archive always has a *.har extension.

+ How to create an archive? +

+ Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>

-archiveName is the name of the archive you would like to create. An example would be foo.har. The name should have a *.har extension. - The parent argument is to specify the relative path to which the files should be + The parent argument is to specify the relative path to which the files + should be archived to. Example would be :

-p /foo/bar a/b/c e/f/g

- Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to parent. + Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to + parent. Note that this is a Map/Reduce job that creates the archives. You would - need a map reduce cluster to run this. For a detailed example the later sections.

If you just want to archive a single directory /foo/bar then you can just use

hadoop archive -archiveName zoo.har -p /foo/bar /outputdir

+ need a map reduce cluster to run this. For a detailed example the later + sections.

If you just want to archive a single directory /foo/bar then you + can just use

hadoop archive -archiveName zoo.har -p /foo/bar /outputdir +

@@ -58,7 +60,8 @@ commands in the archives work but with a different URI. Also, note that archives are immutable. So, rename's, deletes and creates return an error. URI for Hadoop Archives is -

har://scheme-hostname:port/archivepath/fileinarchive

har://scheme-hostname:port/archivepath/fileinarchive +

If no scheme is provided it assumes the underlying filesystem. In that case the URI would look like

har:///archivepath/fileinarchive

@@ -66,22 +69,27 @@

Example on creating and looking up archives -

hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo

hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 + /user/zoo

- The above example is creating an archive using /user/hadoop as the relative archive directory. + The above example is creating an archive using /user/hadoop as the + relative archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be - archived in the following file system directory -- /user/zoo/foo.har. Archiving does not delete the input - files. If you want to delete the input files after creating the archives (to reduce namespace), you + archived in the following file system directory -- /user/zoo/foo.har. + Archiving does not delete the input files. If you want to delete + the input files after creating the archives (to reduce namespace), you will have to do it on your own.

Looking up files and understanding the -p option -

Looking up files in hadoop archives is as easy as doing an ls on the filesystem. After you have - archived the directories /user/hadoop/dir1 and /user/hadoop/dir2 as in the exmaple above, to see all - the files in the archives you can just run:

Looking up files in hadoop archives is as easy as doing an ls on the + filesystem. After you have archived the directories /user/hadoop/dir1 and + /user/hadoop/dir2 as in the exmaple above, to see all the files in the + archives you can just run:

hadoop dfs -lsr har:///user/zoo/foo.har/

To understand the significance of the -p argument, lets go through the above example again. If you just do +

To understand the significance of the -p argument, lets go through the + above example again. If you just do an ls (not lsr) on the hadoop archive using

hadoop dfs -ls har:///user/zoo/foo.har

The output should be:

@@ -90,9 +98,11 @@ har:///user/zoo/foo.har/dir1 har:///user/zoo/foo.har/dir2

As you can recall the archives were created with the following command

hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo

hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 + /user/zoo

If we were to change the command to:

hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo

hadoop archive -archiveName foo.har -p /user/ + hadoop/dir1 hadoop/dir2 /user/zoo

then a ls on the hadoop archive using

hadoop dfs -ls har:///user/zoo/foo.har

would give you

@@ -101,17 +111,87 @@ har:///user/zoo/foo.har/hadoop/dir1 har:///user/zoo/foo.har/hadoop/dir2

- Notice that the archived files have been archived relative to /user/ rather than /user/hadoop. + Notice that the archived files have been archived relative to + /user/ rather than /user/hadoop.

- Using Hadoop Archives with Map Reduce -

Using Hadoop Archives in Map Reduce is as easy as specifying a different input filesystem than the default file system. - If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this archive for Map Reduce input, all - you need to specify the input directory as har:///user/zoo/foo.har. Since Hadoop Archives is exposed as a file system - Map Reduce will be able to use all the logical input files in Hadoop Archives as input.

+ Using Hadoop Archive with Map Reduce +

Using Hadoop Archive in Map Reduce is as easy as specifying a + different input filesystem than the default file system. + If you have a hadoop archive stored in HDFS in /user/zoo/foo.har + then for using this archive for Map Reduce input, all + you need to specify the input directory as har:///user/zoo/foo.har. + Since Hadoop Archives is exposed as a file system Map Reduce will be able + to use all the logical input files in Hadoop Archives as input.

+ +

+ File Replication and Permissions of Hadoop Archive +

+ Hadoop Archive currently does not store the file information metadata + that the files had before they were archived. The file permissions of + Hadoop Archive created is the default permissions that a user creates + file with. The file replication of data files in Hadoop Archive is set to + 3 and the metadata information files have a replication factor of 5. You + can increase this by increasing the replication factor of files under the har + file directory. + On restoration of hadoop archive files using something like: +

+ distcp har:///path_to_har dest_path +

+ the restored files will not have the permissions/replication of the original + files that were archived. +

+ +

+ Creating different block size and part size Hadoop Archive +

+ You can create different hadoop block size and part size using the following + options: +

+ bin/hadoop archive -Dhar.block.size=512 -Dhar.partfile.size=1024 -archiveName ... +

+ The above example allows you to set a block size of 512 bytes for part files + and part file size of 1K. These numbers are only as examples. Using such low + number for block size and part file size is not advisable at all! +

+ +

+ Limitations of Hadoop Archive +

+ Currently Hadoop archive do not support input paths with spaces in it. It + throws out an exception in such a case. You can create archives with names + in which space can be replaced by a valid character. + Below is an example: +

+ + bin/hadoop archive -Dhar.space.replacement.enable=true -Dhar.space.replacement="_" + -archiveName ...... + +

The above example replaces space with "_" in the archived file names. +

+ +

+ Internals of Hadoop Archive +

+ A Hadoop Archive directory contains metadata (in the form + of _index and _masterindex) and data (part-*) files. The _index file contains + the name of the files that are part of the archive and the location + within the part files. The _masterindex file stores offsets into the _index + file to make it easier to seek into the _index file for faster lookups. +