hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saurabh Nanda <>
Subject Re: Re: bz2 Splits.
Date Sat, 25 Jul 2009 10:46:17 GMT
> If you want to load data (in compressed/uncompressed text format) into
>> a table, you have to defined the table as "stored as textfile" instead
>> of "stored as sequencefile".

I tried both the approaches.

Approach #1:
a) gunzip log file
b) import into textfile table
c) set hive.exec.compress.output to true
d) inserted into sequencefile table

It seems to have given me 125 files named 'attempt_*' in the partition's
directory. All under 10MB. (How do I find out the total size of a directory?
Need to see how much saving the compression resulted in)

Approach #2:   imported gzip log files into a textfile table

The files seem to have been copied as-is into the partition's directory. But
every query is always split up into 8 maps (which is the number of files I
imported). This, I guess won't help me much because I would be under
utilizing the map power I have.

Here's something interesting. I ran a SELECT COUNT(1) on all the three
tables and go different results and wildly different response times.

Gunzipped files imported into textfile table: 8,259,720 (108 sec)
sequencefile table populated by step 1d above:  8,316,946 (114 sec)
Gzip files imported into textfile tables: 8,619,980 (50 sec)

How is a simple row count differing? And surprisingly lesser maps resulted
in better performance!


View raw message