hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bejoy KS" <bejoy...@yahoo.com>
Subject Re: Obvious and not so obvious query optimzations in Hive
Date Wed, 27 Jun 2012 18:46:14 GMT
Hi Richin

I'm not an AWS guy but still lemme try answering a few questions in general (not wrt AWS EMR)

1.       Having my external table data gzipped and reading it in the table v/s no compression
at all.

Bejoy: When the data volume is large compression saves up the disk space and hence it is recommended.
Gzip is not splittable, means one file can't be distributed across map tasks. Go in with Lzo
or Snappy.

2.       Having the external table data on S3 v/s having it on HDFS?

3.       Creating intermediate external tables v/s non external tables v/s creating
views?

Bejoy: External Tables are not much different than managed tables. In managed tables when
you drop the table using HQL the underlying data in hdfs is also deleted but in case of External
tables only the table defn is dropped and the data in hdfs is preserved. Views are not evaluated
on its creation time, but when it is used MR jobs are triggered and the required data is extracted
out of source tables. So if you are planning to reuse a view n number of times it'll be better
creating a table and using it else the view query will be evaluated n times.

4.       Storing the external table as Textfile v/s Sequence file. I know sequence file
compresses the data, but in what format? I read about RC files and how efficient they are,
how to use them?

Bejoy: Sequence files are splitable on its own so it is a good choice when using with Snappy.
In sequence file, compression happens either at record level or block level which is configurable
at the time of compressing. If you are using Gzip, TextFiles offer more compression ratio
as the whole data is compressed in a go compared with Sequence files where it happens per
record/block. But then you compromise on splitability. RC files are a good choice when your
queries involve querying only a few columns rather than all columns in a row. 

5.       How are number of reducers get set for a Hive query (The way group by and order
by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it
from the underlying Hadoop cluster? I am trying to understand the bottleneck between query
and cluster size.

Bejoy: Some queries in hive are forced to have just a single reducer like Order By. In case
of other queries hive determines the number of reducers. However you can always specify the
number of reducer on a per query basis based on the data it process. 

6.       Any other optimizations/ best practices?

Bejoy: There are lots of other optimizations in hive which can be injected based on the query.
There are various join optimizations, group by optimizations etc suited for specific needs.



Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: <richin.jain@nokia.com>
Date: Wed, 27 Jun 2012 15:47:54 
To: <user@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: Obvious and not so obvious query optimzations in Hive

Hey Hivers,

I am trying to understand what are some of the obvious and not so obvious optimization I could
do for a Hive Query on AWS EMR cluster. I know the answer for some of these questions but
want to know what do you guys think and by what factor it affects the performance over the
other approach.


1.       Having my external table data gzipped and reading it in the table v/s no compression
at all.

2.       Having the external table data on S3 v/s having it on HDFS?

3.       Creating intermediate external tables v/s non external tables v/s creating views?

4.       Storing the external table as Textfile v/s Sequence file. I know sequence file compresses
the data, but in what format? I read about RC files and how efficient they are, how to use
them?

5.       How are number of reducers get set for a Hive query (The way group by and order by
sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from
the underlying Hadoop cluster? I am trying to understand the bottleneck between query and
cluster size.

6.       Any other optimizations/ best practices?

Thanks a lot in advance.
Richin

Mime
View raw message