impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Behm <alex.b...@cloudera.com>
Subject Re: Local join instead of data exchange - co-located blocks
Date Mon, 12 Mar 2018 17:38:40 GMT
I suppose one exception is if your data lives only on a single node. Then
you can set num_nodes=1 and make sure to send the query request to the
impalad running on the same data node as the target data. Then you should
get a local join.

On Mon, Mar 12, 2018 at 9:30 AM, Alexander Behm <alex.behm@cloudera.com>
wrote:

> Such a specific block arrangement is very uncommon for typical Impala
> setups, so we don't attempt to recognize and optimize this narrow case. In
> particular, such an arrangement tends to be short lived if you have the
> HDFS balancer turned on.
>
> Without making code changes, there is no way today to remove the data
> exchanges and make sure that the scheduler assigns scan splits to nodes in
> the desired way (co-located, but with possible load imbalance).
>
> In what way is the current setup unacceptable to you? Is this pre-mature
> optimization? If you have certain performance expectations/requirements for
> specific queries we might be able to help you improve those. If you want to
> pursue this route, please help us by posting complete query profiles.
>
> Alex
>
> On Mon, Mar 12, 2018 at 6:29 AM, Philipp Krause <philippkrause.mail@
> googlemail.com> wrote:
>
>> Hello everyone!
>>
>> In order to prevent network traffic, I'd like to perform local joins on
>> each node instead of exchanging the data and perform a join over the
>> complete data afterwards. My query is basically a join over three three
>> tables on an ID attribute. The blocks are perfectly distributed, so that
>> e.g. Table A - Block 0  and Table B - Block 0  are on the same node. These
>> blocks contain all data rows with an ID range [0,1]. Table A - Block 1  and
>> Table B - Block 1 with an ID range [2,3] are on another node etc. So I want
>> to perform a local join per node because any data exchange would be
>> unneccessary (except for the last step when the final node recevieves all
>> results of the other nodes). Is this possible?
>> At the moment the query plan includes multiple data exchanges, although
>> the blocks are already perfectly distributed (manually).
>> I would be grateful for any help!
>>
>> Best regards
>> Philipp Krause
>>
>>
>

Mime
View raw message