hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian <liu...@yahoo.com>
Subject Partition performance
Date Thu, 04 Apr 2013 23:01:19 GMT
Hi,
 
I created 3 years of hourly log files (totally 26280 files), and use External Table with partition
to query. I tried two partition methods.
 
1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory per hour). Use date
and hour as partition keys. Add 3 years of directories to the table partitions. So there are
26280 partitions.
        CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, hr
int);
        ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION '/test1/2013/04/02/16';
 
2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory per day, 24 files in
each directory). Use date as partition key. Add 3 years of directories to the table partitions.
So there are 1095 partitions.        CREATE EXTERNAL TABLE test2 (logline string)
PARTITIONED BY (dt string);
        ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION '/test2/2013/04/02';
 
When doing a simple query like 
    SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <= '2013-02-14'
Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. 
 
I'm wondering why there is a big performance difference between these two? These two approaches
have the same number of files, only the directory structure is different. So Hive is going
to load the same amount of files. Why does the number of partitions have such big impact?
Does that mean #2 is a better partition strategy?
 
Thanks.
Mime
View raw message