impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Quanlong Huang" <>
Subject Re:PHJ node assignment
Date Mon, 12 Feb 2018 11:51:26 GMT
IMU, the left side is always located with the hash join node. If the stats are correct, the
left side will always be a larger table/input. There're two terminologies in the hash join
algorithm: build and probe. The smaller table that can be built into an in-memory hash table
is called the "build" input. It's represented at the right side. After the in-memory hash
table is built, the larger table will be scanned and rows will be probed in the hash table
to find matched results. The larger table is called the "probe" input and represented at the
left side.So not all rows are sent across the network to perform a hash join. Usually the
larger table is scanned locally. Network traffic comes from the "build" input. It's smaller
and sometimes can even be represented as a BloomFilter (one kind of RuntimeFilter in Impala).

However, there's still one case that all rows are sent across the network anyway. That is
when all tables are not located in the Impala cluster (e.g. Impala is deployed in a portion
of the Hadoop cluster). Scanning the tables both consumes network traffic. However, when performing
hash join, the results of the right side will be sent to the left side, since they have smaller
size and consumes less network traffic than sending the left side.

I find this paper in "Impala Reading List" has much more details and deserves to be read more
Hash joins and hash teams in Microsoft SQL Server (Graefe, Bunker, Cooper)


At 2018-02-12 18:13:09, "Jeszy" <> wrote:
>IIUC, every row scanned in a partitioned hash join (both sides) is sent
>across the network (an exchange on HASH(key)). The targets of this exchange
>are nodes that have data locality with the left side of the join. Why does
>Impala do it that way?
>Since all rows are sent across the network anyway, Impala could just use
>all the nodes in the cluster. The upside would be better parallelism for
>the join itself as well as for all the operators sitting on top of it. Is
>there a downside I'm forgetting?
>If not, is there a jira tracking this already? Haven't found one.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message