impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Behm <>
Subject Re: Local join instead of data exchange - co-located blocks
Date Mon, 12 Mar 2018 16:30:29 GMT
Such a specific block arrangement is very uncommon for typical Impala
setups, so we don't attempt to recognize and optimize this narrow case. In
particular, such an arrangement tends to be short lived if you have the
HDFS balancer turned on.

Without making code changes, there is no way today to remove the data
exchanges and make sure that the scheduler assigns scan splits to nodes in
the desired way (co-located, but with possible load imbalance).

In what way is the current setup unacceptable to you? Is this pre-mature
optimization? If you have certain performance expectations/requirements for
specific queries we might be able to help you improve those. If you want to
pursue this route, please help us by posting complete query profiles.


On Mon, Mar 12, 2018 at 6:29 AM, Philipp Krause <> wrote:

> Hello everyone!
> In order to prevent network traffic, I'd like to perform local joins on
> each node instead of exchanging the data and perform a join over the
> complete data afterwards. My query is basically a join over three three
> tables on an ID attribute. The blocks are perfectly distributed, so that
> e.g. Table A - Block 0  and Table B - Block 0  are on the same node. These
> blocks contain all data rows with an ID range [0,1]. Table A - Block 1  and
> Table B - Block 1 with an ID range [2,3] are on another node etc. So I want
> to perform a local join per node because any data exchange would be
> unneccessary (except for the last step when the final node recevieves all
> results of the other nodes). Is this possible?
> At the moment the query plan includes multiple data exchanges, although
> the blocks are already perfectly distributed (manually).
> I would be grateful for any help!
> Best regards
> Philipp Krause

View raw message