hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/LanguageManual/Archiving" by PaulYang
Date Tue, 02 Nov 2010 21:06:44 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/LanguageManual/Archiving" page has been changed by PaulYang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Archiving?action=diff&rev1=2&rev2=3

--------------------------------------------------

  
  == Overview ==
  
- Due to the design of HDFS, the number of files on HDFS directly affect the memory consumption
in the namenode. While normally not a problem for small clusters, memory usage may hit the
limits of accessible memory on a single machine when there are >50-100 million files. Consequently,
it is advantageous to have as few files as possible.
+ Due to the design of HDFS, the number of files in the filesystem directly affect the memory
consumption in the namenode. While normally not a problem for small clusters, memory usage
may hit the limits of accessible memory on a single machine when there are >50-100 million
files. Consequently, it is advantageous to have as few files as possible.
  
  The use of [[http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html | Hadoop
Archives]] is one approach to reducing the number of files in a partition. Hive has built-in
support that allows users to easily move files in existing partitions to a Hadoop Archive
(HAR) file so that a partition that may once have consisted of 100's of files occupy ~3 files
(depending on settings) However, the trade off is that queries may be slower due to the additional
overhead in indirection.
  
@@ -26, +26 @@

  
  {{{hive.archive.har.parentdir.settable}}} controls whether archiving operations are enabled.
  
- {{{hive.archive.har.parentdir.settable}}} informs Hive whether the parent directory is set-able
while creating the archive. In the latest version of Hadoop the {{{-p}}} option could be set
to specify the root directory of the archive. For example, if {{{/dir1/dir2/file}} were archived
with {{{/dir1}}} as the parent directory, then the resulting archive file will contain the
directory structure {{{dir2/file}}}. In older versions of Hadoop, this option was not available
and therefore Hive must be configured to accommodate this limitation. 
+ {{{hive.archive.har.parentdir.settable}}} informs Hive whether the parent directory is set-able
while creating the archive. In the latest version of Hadoop the {{{-p}}} option could be set
to specify the root directory of the archive. For example, if {{{/dir1/dir2/file}}} were archived
with {{{/dir1}}} as the parent directory, then the resulting archive file will contain the
directory structure {{{dir2/file}}}. In older versions of Hadoop, this option was not available
and therefore Hive must be configured to accommodate this limitation. 
  
  {{{har.partfile.size}}} controls the size of the files that make up the archive. The archive
will contain {{{har.partfile.size/[Size of partition]}}} files, rounded up. Higher values
mean fewer files, but will result in longer archiving times due to the reduced number of mappers.
  
@@ -44, +44 @@

  ALTER TABLE srcpart ARCHIVE PARTITION(ds='2008-04-08', hr='12')
  }}}
  
- Once the command is issued, a mapreduce job will be launched that performs the archiving.
Note that there is no output on the CLI to indicate process.
+ Once the command is issued, a mapreduce job will be to perform the archiving. Unlike running
Hive queries, there is no output on the CLI to indicate process.
  
  === Unarchive ===
  
@@ -56, +56 @@

  
  == Cautions and Limitations ==
  
-  * In some older versions of Hadoop, HAR had a few bugs that could cause data loss / corruption.
Be sure that these patches are integrated into your version of Hadoop:
+  * In some older versions of Hadoop, HAR had a few bugs that could cause data loss or other
errors. Be sure that these patches are integrated into your version of Hadoop:
  
  [[https://issues.apache.org/jira/browse/MAPREDUCE-1548]]
  
@@ -72, +72 @@

  
  Hive comes with the HiveHarFileSystem class that addresses some of these issues, and is
by default the value for {{{fs.har.impl}}}. Keep this in mind if you're rolling own version
of HarFileSystem. 
  
-  * The default HiveHarFileSystem.getFileBlockLocations() has '''no locality''. That means
it may introduce higher network loads or reduced performance.
+  * The default HiveHarFileSystem.getFileBlockLocations() has '''no locality'''. That means
it may introduce higher network loads or reduced performance.
  
   * Archived partitions cannot be overwritten with INSERT OVERWRITE ... The partition must
be unarchived first.
   
-  * If two processes attempt to archive the same partition at the same time, bad things can
happen. (Need to implement concurrency support..)
+  * If two processes attempt to archive the same partition at the same time, bad things could
happen. (Need to implement concurrency support..)
  
  == Under the hood ==
  
- Internally, when a partition is archived, t
+ Internally, when a partition is archived, a HAR is created using the files from the partition's
original location (e.g. {{{/warehouse/table/ds=1}}}). The parent directory of the partition
is specified to be the same as the original location and the resulting archive is named 'data.har'.
The archive is moved under the original directory (e.g. {{{/warehouse/table/ds=1/data.har}}})
and the partition's location is changed to point to the archive.
  

Mime
View raw message