Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DEE4B9D42 for ; Wed, 29 Aug 2012 10:54:47 +0000 (UTC) Received: (qmail 11392 invoked by uid 500); 29 Aug 2012 10:54:46 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 11232 invoked by uid 500); 29 Aug 2012 10:54:45 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 11213 invoked by uid 99); 29 Aug 2012 10:54:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Aug 2012 10:54:44 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ravishetye@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qa0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Aug 2012 10:54:37 +0000 Received: by qady1 with SMTP id y1so343220qad.14 for ; Wed, 29 Aug 2012 03:54:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=LXjOZfdlYS+KeWqOrrpyP9Alh1hXYEdBQoOZQzTM8n8=; b=Uff78EtMTCvmfERIgthwSOVP8ynhDaMooUlbOzOFuSOw4ND6fD0zY537RYQEEVHEf2 aW5C1aT6h/EQLqs82yXjJbRYXiFVvTH8ufb3oIFiIB7QTOURPACpJnrja9XnGc5NEhjh mImiksXYd7wtoERBwkcEkqkgyWzR9F2R8kWkWIQilHpdPeFYaG+887X+Dgdw334z/938 4MOXyNL8SDACmRXZrYYFj2L5f8Opysveiht8VvB9JXMhAxXI08nKDP9a54Dy8SHbL2Uz X65qChroDPQbzK0DmPV/gimXsPqxiiUFz14xz4mHE6sluSSesxXKwfi9RpaczIznuzYV /1OA== MIME-Version: 1.0 Received: by 10.224.200.130 with SMTP id ew2mr2521456qab.92.1346237656279; Wed, 29 Aug 2012 03:54:16 -0700 (PDT) Received: by 10.49.85.134 with HTTP; Wed, 29 Aug 2012 03:54:16 -0700 (PDT) Date: Wed, 29 Aug 2012 16:24:16 +0530 Message-ID: Subject: Performance comparision external s3 table vs managed table From: Ravi Shetye To: user@hive.apache.org Content-Type: multipart/alternative; boundary=20cf300fac8b4929f604c86561ec --20cf300fac8b4929f604c86561ec Content-Type: text/plain; charset=ISO-8859-1 I am launching HIVE cluster in interactive mode http://aws.amazon.com/elasticmapreduce/faqs/#hive-6. I data on s3 like *s3://ravi/logs/adv_id=123/date=2012-01-01/log.gz* *s3://ravi/logs/adv_id=456/date=2012-01-02/log.gz* *s3://ravi/logs/adv_id=123/date=2012-01-03/log.gz* I create two tables CREATE EXTERNAL TABLE s3Table (...) PARTITIONED BY (adv_id STRING, dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://ravi/logs/'; CREATE TABLE managedTable (...) ==> same defination PARTITIONED BY (adv_id STRING, dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; I load data into both tables ALTER TABLE s3Table RECOVER PARTITIONS; and set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true INSERT OVERWRITE TABLE managedTable PARTITION (adv_id,dt) SELECT * FROM s3Table; Intuitively I am expecting the managedTable to perform better. I run a count(*) query on both which cont approx 40,000,000 rows The one for s3Table generates mapper per patition and finishes in 149 sec The one for managedTable generates mapper per HDFS Block and finishes in 238sec Can I improve upon the performance of managedTable by any tuning parameters? Should I NOT be using managedTable ever? I did the experiment on m1.large cluster to avoid any IO vs Network reasoning. -- RAVI SHETYE --20cf300fac8b4929f604c86561ec Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I am launching HIVE cluster =A0in interactive mode

I data on s3 =A0like
s3://ravi/logs/adv_id=3D123/date=3D2012-01-01/log.gz
s3://ravi/logs/adv_id=3D456/date=3D2012-01-02=
/log.gz
s3://ravi/logs/adv_id=3D123/date=3D2012-01-03/log.gz
I create two tables
CREATE EXTERNAL TABLE s3Table (...)
PA= RTITIONED BY (adv_id STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS= TERMINATED BY '\t' LOCATION 's3://ravi/logs/';
CREATE TABLE man= agedTable (...) =A0 =3D=3D> same defination
PA= RTITIONED BY (adv_id STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS= TERMINATED BY '\t';

I load data into both tables
ALTER TABLE s3Table RECOVER PARTITIONS;
and
set hive.exec.dynamic.partition.mode= =3Dnonstrict;
se= t hive.exec.dynamic.partition=3Dtrue
INSERT OVERWRITE TABLE managed= Table PARTITION (adv_id,dt) SELECT * FROM s3Table;

Intuitively=A0I am expectin= g the managedTable to perform better.

I run a count(*) query on both which cont approx 40,= 000,000 rows
The one for s3Table generates mapper per patition and finishes in 149 se= c
The one for managedTable generates mapper per HDFS Block a= nd finishes in 238sec

Can I improve upon the performance of managedTable by any tuning paramete= rs?
Should I NOT be us= ing managedTable ever?

I did the exp= eriment on m1.large cluster to avoid any IO vs Network reasoning.
--=A0
RAVI SHETYE
--20cf300fac8b4929f604c86561ec--