spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liquan Pei <liquan...@gmail.com>
Subject Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?
Date Tue, 30 Sep 2014 04:31:23 GMT
Hi Haopu,

My understanding is that the hashtable on both left and right side is used
for including null values in result in an efficient manner. If hash table
is only built on one side, let's say left side and we perform a left outer
join, for each row in left side, a scan over the right side is needed to
make sure that no matching tuples for that row on left side.

Hope this helps!
Liquan

On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang <HWang@qilinsoft.com> wrote:

> I take a look at HashOuterJoin and it's building a Hashtable for both
> sides.
>
> This consumes quite a lot of memory when the partition is big. And it
> doesn't reduce the iteration on streamed relation, right?
>
> Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message