hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: optimize joins in hive 1.2.1
Date Mon, 18 Jan 2016 08:37:24 GMT
Do you have some data model?

Basically modern technologies, such as Hive, but also relational database, suggest to prejoin
tables and working on big flat tables. The reason is that they are distributed systems and
you should avoid transferring for each query a lot of data between nodes. 
Hence, Hive works with the ORC format (or Parquet) to support storage indexes and bloom filters.
The idea here is that these techniques allow you to skip reading a lot of data blocks. So
the first optimization is using the right format. The second optimization is to use compression.
For example, snappy allows fast decompression so it is more suitable for current data. The
third optimizations are partitions and buckets. The fourth optimization is evaluating the
different type of joins. The fifths optimizations can be in-memory technologies, such as ignite
HDFS cache or llap. The six optimization is the execution engine. Tez supports currently in
Hive the most optimizations. The seventh optimization is data model: use for join keys int
or similar numeric values. Try to use the right data type for your needs. The eight optimization
can be hardware improvements.  The ninth optimization can be use the right data structure.
If your queries are more graph type queries then consider using a graph database, such as
TitanDB. The tenth type of optimizations are os/network level type of optimizations. For example
using Jumbo frames for your network.
Not all necessary in this order and there is much more to optimize.

But as always this depends on your data structure. Sometimes joins can make sense. Most of
the time you have to experiment.

> On 18 Jan 2016, at 09:07, Divya Gehlot <> wrote:
> Hi,
> Need tips/guidance to optimize(increase perfomance) billion data rows  joins in hive
> Any help would be appreciated.
> Thanks,
> Divya 

View raw message