hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/LanguageManual/Archiving" by PaulYang
Date Tue, 02 Nov 2010 01:25:39 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/LanguageManual/Archiving" page has been changed by PaulYang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Archiving

--------------------------------------------------

New page:
= Archiving for File Count Reduction =

Note: Archiving should be considered an advanced command due to the caveats involved.

<<TableOfContents>>

== Overview ==

Due to the design of HDFS, the number of files on HDFS directly affect the memory consumption
in the namenode. While normally not a problem for small clusters, memory usage may hit the
limits of accessible memory on a single machine when there are >50-100 million files. Consequently,
it is advantageous to have as few files as possible.

The use of [[http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html | Hadoop
Archives]] is one approach to reducing the number of files in a partition. Hive has built-in
support that allows users to easily move files in existing partitions to a Hadoop Archive
(HAR) file so that a partition that may once have consisted of 100's of files occupy ~3 files
(depending on settings) However, the trade off is that queries may be slower due to the additional
overhead in indirection.

Note that archiving does NOT compress the files - HAR is analogous to the unix tar command.

== Settings ==

There are 3 settings that should be configured before archiving is used. (Example values are
shown)

{{{
hive> set hive.archive.enabled=true;
hive> set hive.archive.har.parentdir.settable=true;
hive> set har.partfile.size=1099511627776;
}}}

{{{hive.archive.har.parentdir.settable}}} controls whether archiving operations are enabled.

{{{hive.archive.har.parentdir.settable}}} informs Hive whether the parent directory is set-able
while creating the archive. In the latest version of Hadoop the {{{-p}}} option could be set
to specify the root directory of the archive. For example, if {{{/dir1/dir2/file}} were archived
with {{{/dir1}}} as the parent directory, then the resulting archive file will contain the
directory structure {{{dir2/file}}}. In older versions of Hadoop, this option was not available
and therefore Hive must be configured to accommodate this limitation. 

{{{har.partfile.size}}} controls the size of the files that make up the archive. The archive
will contain {{{har.partfile.size/[Size of partition]}}} files, rounded up. Higher values
mean fewer files, but will result in longer archiving times due to the reduced number of mappers.

== Usage ==

=== Archive ===
Once the configuration values are set, a partition can be archived with the command:

{{{
ALTER TABLE table_name ARCHIVE PARTITION (partition_col = partition_col_value, partition_col
= partiton_col_value, ...)
}}}

e.g. 
{{{
ALTER TABLE srcpart ARCHIVE PARTITION(ds='2008-04-08', hr='12')
}}}

Once the command is issued, a mapreduce job will be launched that performs the archiving.
Note that there is no output on the CLI to indicate process.

=== Unarchive ===

The partition can be reverted back to its original files with the unarchive command:

{{{
ALTER TABLE srcpart UNARCHIVE PARTITION(ds='2008-04-08', hr='12')
}}}

== Cautions and Limitations ==

 * In some older versions of Hadoop, HAR had a few bugs that could cause data loss / corruption.
Be sure that these patches are integrated into your version of Hadoop:

[[https://issues.apache.org/jira/browse/MAPREDUCE-1548]]

[[https://issues.apache.org/jira/browse/HADOOP-6591]]

[[https://issues.apache.org/jira/browse/MAPREDUCE-2143]]

 * The HarFileSystem class still has a few bugs that have yet to be fixed:

[[https://issues.apache.org/jira/browse/MAPREDUCE-1752]]

[[https://issues.apache.org/jira/browse/MAPREDUCE-1877]]

Hive comes with the HiveHarFileSystem class that addresses some of these issues, and is by
default the value for {{{fs.har.impl}}}. Keep this in mind if you're rolling own version of
HarFileSystem. 

 * The default HiveHarFileSystem.getFileBlockLocations() has NO LOCALITY. That means it may
introduce higher network loads or reduced performance.

 * Archived partitions cannot be overwritten with INSERT OVERWRITE ... The partition must
be unarchived first.

== Under the hood ==

Mime
View raw message