hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerd K├Ânig <>
Subject config recommendations to boost performance
Date Wed, 25 Feb 2015 08:05:36 GMT

I'm a bit stuck in optimizing the hive/tez config parameters to speed up
Hive/Tez query execution.
The cluster consists of 6 worker nodes (with rather hadoop-non-ideal
component proportion, but that's given) including: 48Cores/384GB Ram/10HDDs.
The Hive table is configured as:
- partitioned by day
- 12 buckets (bucketed on a smallint column)
- transactional=true
- snappy compressed ORC format
and it contains about 200TB of data.
Every 5 minutes new arrived data will be inserted (if any), this, of
course, leads to a potential high number of delta-files.

A query like "select name,count(id) from table where date='2015-01-01' or
date='2015-01-02' group by (name)" takes almost forever and needs to be
cancelled after ~30min.

Of course, Hive will never be a performance beast, but by executing with
Tez I hoped to get much better performance...

Some current settings:
yarn.nodemanager.resource.memory-mb : 304640
yarn.scheduler.minimum-allocation-mb : 15360 : 20480
mapreduce.reduce.memory.mb : 25600 : -Xmx12288m : -Xmx15360m
Set hive.execution.engine=tez;
set 268435456;
set hive.exec.reducers.max=6;
set mapreduce.job.reduces=6;

My thoughts are:
- improve the data ingestion to reduce the number of delta-files and
thereby reduce the number of mappers being required
- improve the settings for the automatic compaction to further reduce the
number of files, no. of mappers resp.
- YARN config should be o.k., see properties above

What are the main Tez/Hive properties to check/adjust that could improve
the performance in the given environment ?!?!

Many thanks in advance, G.

View raw message