hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Shetye <ravishe...@gmail.com>
Subject Performance comparision external s3 table vs managed table
Date Wed, 29 Aug 2012 10:54:16 GMT
I am launching HIVE cluster  in interactive mode
http://aws.amazon.com/elasticmapreduce/faqs/#hive-6.

I data on s3  like

*s3://ravi/logs/adv_id=123/date=2012-01-01/log.gz*

*s3://ravi/logs/adv_id=456/date=2012-01-02/log.gz*

*s3://ravi/logs/adv_id=123/date=2012-01-03/log.gz*

I create two tables

CREATE EXTERNAL TABLE s3Table (...)
PARTITIONED BY (adv_id STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://ravi/logs/';

CREATE TABLE managedTable (...)   ==> same defination
PARTITIONED BY (adv_id STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

I load data into both tables
ALTER TABLE s3Table RECOVER PARTITIONS;
and
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true
INSERT OVERWRITE TABLE managedTable PARTITION (adv_id,dt) SELECT * FROM
s3Table;

Intuitively I am expecting the managedTable to perform better.

I run a count(*) query on both which cont approx 40,000,000 rows
The one for s3Table generates mapper per patition and finishes in 149 sec
The one for managedTable generates mapper per HDFS Block and finishes in
238sec

Can I improve upon the performance of managedTable by any tuning parameters?
Should I NOT be using managedTable ever?

I did the experiment on m1.large cluster to avoid any IO vs Network
reasoning.
-- 
RAVI SHETYE

Mime
View raw message