hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSichi
Date Sat, 17 Apr 2010 00:39:31 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.


  The CREATE TABLE creates a dummy table which controls how the output of the sort is written.
 Note that it uses {{{HiveHFileOutputFormat}}} to do this, with the table property {{{hfile.family.path}}}
used to control the destination directory for the output.  Again, be sure to set the inputformat/outputformat
exactly as specified.  In the example above, we select gzip (gz) compression for the result
files; if you don't set the {{{hfile.compression}}} parameter, no compression will be performed.
 (The other method available is lzo, which compresses less aggressively but does not require
as much CPU power.)
+ There is a parameter {{{hbase.hregion.max.filesize}}} (default 256MB) which affects how
HFiles are generated.  If the amount of data (pre-compression) produced by a reducer exceeds
this limit, more than one HFile will be generated for that reducer.  This will lead to unbalanced
region files.  This will not cause any correctness problems, but if you want to get balanced
region files, either use more reducers or set this parameter to a larger value.  Note that
when compression is enabled, you may see multiple files generated whose sizes are well below
the limit; this is because the overflow check is done pre-compression.
  The {{{cf}}} in the path specifies the name of the column family which will be created in
HBase, so the directory name you choose here is important.  (Note that we're not actually
using an HBase table here; {{{HiveHFileOutputFormat}}} writes directly to files.)
  The CLUSTER BY clause provides the keys to be used by the partitioner; be sure that it matches
the range partitioning that you came up with in the earlier step.

View raw message