hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerd König <koenig.boden...@googlemail.com>
Subject Re: config recommendations to boost performance
Date Wed, 25 Feb 2015 15:15:07 GMT
Hi Rajesh,

thanks for your quick response.
After quitting the job, no further containers are being launched.
Unfortunately I have no execution plan (EXPLAIN output) to dive into that
execution in detail.

Do you have recommendations of Tez/Hive parameters that influence the
execution of TBs of data within a small amount of worker-nodes (== small
no. of mappers in parallel) in general. What do be checked anyway ?

Are my initial thoughts of trying to reduce the no. of files to reduce the
no. of mappers going in the right direction ?

thanks, G.

Rajesh Balamohan <rajesh.balamohan@gmail.com> schrieb am Wed Feb 25 2015 at
11:45:07 AM:

> >>
> A query like "select name,count(id) from table where date='2015-01-01' or
> date='2015-01-02' group by (name)" takes almost forever and needs to be
> cancelled after ~30min.
> >>
>
> It should have ideally scanned only the 2 partitions. Do you see any
> container launches after which you had to kill the job? Or is the split
> computation itself taking more time?.
>
> ~Rajesh.B
>
>
> On Wed, Feb 25, 2015 at 1:35 PM, Gerd König <
> koenig.bodensee@googlemail.com> wrote:
>
>> Hi,
>>
>> I'm a bit stuck in optimizing the hive/tez config parameters to speed up
>> Hive/Tez query execution.
>> The cluster consists of 6 worker nodes (with rather hadoop-non-ideal
>> component proportion, but that's given) including: 48Cores/384GB Ram/10HDDs.
>> The Hive table is configured as:
>> - partitioned by day
>> - 12 buckets (bucketed on a smallint column)
>> - transactional=true
>> - snappy compressed ORC format
>> and it contains about 200TB of data.
>> Every 5 minutes new arrived data will be inserted (if any), this, of
>> course, leads to a potential high number of delta-files.
>>
>> A query like "select name,count(id) from table where date='2015-01-01' or
>> date='2015-01-02' group by (name)" takes almost forever and needs to be
>> cancelled after ~30min.
>>
>> Of course, Hive will never be a performance beast, but by executing with
>> Tez I hoped to get much better performance...
>>
>> Some current settings:
>> yarn.nodemanager.resource.memory-mb : 304640
>> yarn.scheduler.minimum-allocation-mb : 15360
>> mapreduce.map.memory.mb : 20480
>> mapreduce.reduce.memory.mb : 25600
>> mapreduce.map.java.opts : -Xmx12288m
>> mapreduce.reduce.java.opts : -Xmx15360m
>> Set hive.execution.engine=tez;
>> set tez.queue.name=highresourcequeue;
>> set tez.am.grouping.min-size= 268435456;
>> set hive.exec.reducers.max=6;
>> set mapreduce.job.reduces=6;
>>
>>
>> My thoughts are:
>> - improve the data ingestion to reduce the number of delta-files and
>> thereby reduce the number of mappers being required
>> - improve the settings for the automatic compaction to further reduce the
>> number of files, no. of mappers resp.
>> - YARN config should be o.k., see properties above
>>
>> What are the main Tez/Hive properties to check/adjust that could improve
>> the performance in the given environment ?!?!
>>
>> Many thanks in advance, G.
>>
>
>
>
> --
> ~Rajesh.B
>

Mime
View raw message