hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "CompressedStorage" by SaurabhNanda
Date Mon, 27 Jul 2009 10:48:04 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by SaurabhNanda:
http://wiki.apache.org/hadoop/CompressedStorage

The comment on the change is:
first version of the page

New page:
== Compressed Data Storage ==
Keeping data compressed in Hive tables has, in some cases, known to give better performance
that uncompressed storage; both, in terms of disk usage and query performance.

You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile.
The compression will be detected automatically and the file will be decompressed on-the-fly
during query execution. For example:

{{{
CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;
}}}

The table 'raw' is stored as a TextFile, which is the default storage. However, in this case
Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel.
This can cause under-utilization of your cluster's 'mapping' power.

The recommended practice is to insert data into another table, which is stored as a SequenceFile.
A SequenceFile can be split by Hadoop and distributed across map jobs '''(is this statement
correct?)'''. For example:

{{{
CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

CREATE TABLE raw_sequence (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
   STORED AS SEQUENCEFILE;

LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;

SET hive.exec.compress.output=TRUE; 
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequnce SELECT LINE FROM raw;
}}}

The value for io.seqfile.compression.type determines how the compression is performed. If
you set it to RECORD you will get as many output files as the number of map/reduce jobs. If
you set it to BLOCK, you will get as many output files as there were input files. There is
a tradeoff involved here -- large number of output files => more parellel map jobs =>
lower compression ratio.

Mime
View raw message