impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arya Goudarzi <gouda...@gmail.com>
Subject How to use [SHUFFLE] by default for all JOINS
Date Sat, 24 Feb 2018 02:12:25 GMT
Hi Team,

TL;DR; I am wondering if there is a way to instruct Impala to use shuffle
by default for all join queries as my research didn't end anywhere so far.

We have a multi PiB cluster with hundreds of thousand of partitions. We are
using Impala 1.7 with HDFS. Due to our cluster size, compute_stats, and
compute_incremental_stats are not feasible for us as compute_stats seems a
heavy operation on a lot of our large tables and destabilizes the cluster,
and with compute_incremental_stats we hit IMPALA-2648
<https://issues.apache.org/jira/browse/IMPALA-2648>.

Therefore, to optimize our queries we need to add [shuffle] hint to the
queries with joins, and we have seen that this improves performance 3x on
simple tests because the system doesn't have to stream too much data and
dump it for broadcast join.

We have a large team of analysts who are pushing tons of queries to the
system. It is hard to enforce policy at the moment for them to remember to
use shuffle hint so it doesn't take our system down.

-- 
Cheers,
-Arya

Mime
View raw message