hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Olson <maolso...@gmail.com>
Subject Aggregated table larger than expected
Date Thu, 23 Jun 2016 21:37:34 GMT
Hi,

I am working with an hourly table and a daily table in Hive 1.0.1. Both
tables have the same schema except that the hourly table is partitioned by
dt and hour, but the daily table is partitioned only by dt. At the end of
each day, the records from the hourly table are aggregated into the daily
table, but we do a group by to remove some duplicate records.

There are 1,363,106,822 total rows in the hourly table on 2016-06-20, and
1,300,287,508 rows in the daily table since some are dropped in the group
by. However, the total size of all files in the hourly table for that date
is 135.9 GB, but the total size of files in the daily table is 158.8 GB.
I'm wondering why the daily table would be significantly larger, since it
has fewer records but all the same information for the records it does
have. Both tables are stored as RCFile and
use com.hadoop.compression.lzo.LzoCodec for compression. However, the
hourly table contains about 66000 small files, whereas the daily one
contains 494.

If I remove the group by and max functions from the query and just insert
all the records for 2016-06-20 from the hourly table into the daily table
(so the job becomes map-only), the daily table turns out to be smaller than
the hourly. It seems like the introduction of the group by/reduce phase is
causing the output to be larger somehow. I have also tried storing the
daily table as ORC rather than RCFile, since ORC is more space-efficient.
The total size is 147.2 GB, so smaller than the RCFile version but still
larger than the hourly table.

I've attached the query inserting into the daily table, as well as the
explain output of the query.

Thanks for any help,
Matt

Mime
View raw message