hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brent Miller <brentalanmil...@gmail.com>
Subject Re: Help with Compressed Storage
Date Tue, 16 Feb 2010 23:51:18 GMT
Thank you for the responses and I'm terribly sorry if I'm missing something
obvious here, but after going through google searches a second time and
reviewing your feedback, I'm still having issues with compressed storage not
seeming to work correctly.

The commands that I've been entering into the hive cli are:

SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET hive.exec.compress.output=TRUE;
SET io.seqfile.compression.type=BLOCK;

CREATE TABLE test1_comp_gz (busId TINYINT, uId BIGINT, dStamp STRING, tStamp
STRING, canId STRING, dlc TINYINT, hexData STRING) PARTITIONED BY (bus
TINYINT, day STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY '\n' STORED AS SEQUENCEFILE;

INSERT OVERWRITE TABLE test1_comp_gz PARTITION (bus=0, day='2010-02-01')
SELECT busid, uid, dstamp, tstamp, canid, dlc, hexdata FROM test1 WHERE
bus=0 AND day='2010-02-01';

Is there something wrong here? I have the hive.exec.compress.output=true
line and I had tried adding a hive.exec.compress.intermediate=TRUE at one
point in time thinking it may have had something to do with HIVE-794 but
that seemed to have no effect.

Thanks again,
Brent

On Tue, Feb 16, 2010 at 2:32 PM, Yongqiang He <
heyongqiang@software.ict.ac.cn> wrote:

> Like Zheng said,
> Try set hive.exec.compress.output=true;
> "set hive.exec.compress.intermediate=true" is not recommended because of
> the
> cpu cost.
>
> Also in some cases, set hive.merge.mapfiles = false; will help getting a
> better compression.
>
>
> On 2/16/10 2:04 PM, "Zheng Shao" <zshao9@gmail.com> wrote:
>
> > Try google "Hive compression":
> >
> > See
> >
> http://svn.apache.org/viewvc/hadoop/hive/trunk/common/src/java/org/apache/hado
> >
> op/hive/conf/HiveConf.java?p2=/hadoop/hive/trunk/common/src/java/org/apache/ha
> >
> doop/hive/conf/HiveConf.java&p1=/hadoop/hive/trunk/common/src/java/org/apache/
> >
> hadoop/hive/conf/HiveConf.java&r1=723687&r2=723686&view=diff&pathrev=723687
> >
> >     COMPRESSRESULT("hive.exec.compress.output", false),
> >     COMPRESSINTERMEDIATE("hive.exec.compress.intermediate", false),
> >
> > Hive uses different compression parameters than hadoop.
> >
> > Also, Hive support using different compressions for intermediate
> > results. See https://issues.apache.org/jira/browse/HIVE-759
> >
> >
> > Zheng
> >
> > On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller <brentalanmiller@gmail.com
> >
> > wrote:
> >> Hello, I've seen issues similar to this one come up once or twice
> before,
> >> but I haven't ever seen a solution to the problem that I'm having. I was
> >> following the Compressed Storage page on the Hive
> >> Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized that
> the
> >> sequence files that are created in the warehouse directory are actually
> >> uncompressed and larger than than the originals.
> >> For example, I have a table 'test1' who's input data looks something
> like:
> >> 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
> >> 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
> >> 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
> >> ...
> >> And after creating a second table 'test1_comp' that was crated with the
> >> STORED AS SEQUENCEFILE directive and the compression options SET as
> >> described in the wiki, I can look at the resultant sequence files and
> see
> >> that they're just plain (uncompressed) text:
> >> SEQ "org.apache.hadoop.io.BytesWritable
> org.apache.hadoop.io.Text+�c�!Y�M ��
> >> Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
> >> 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
> >> 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
> >> 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
> >> ...
> >> I've tried messing around with different org.apache.hadoop.io.compress.*
> >> options, but the sequence files always come out uncompressed. Has
> anybody
> >> ever seen this or know away to keep the data compressed? Since the input
> >> text is so uniform, we get huge space savings from compression and would
> >> like to store the data this way if possible. I'm using Hadoop 20.1 and
> Hive
> >> that I checked out from SVN about a week ago.
> >> Thanks,
> >> Brent
> >
> >
>
>
>

Mime
View raw message