Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 32DEF173D8 for ; Tue, 30 Sep 2014 04:31:52 +0000 (UTC) Received: (qmail 26447 invoked by uid 500); 30 Sep 2014 04:31:49 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 26295 invoked by uid 500); 30 Sep 2014 04:31:49 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 25347 invoked by uid 99); 30 Sep 2014 04:31:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Sep 2014 04:31:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of liquanpei@gmail.com designates 74.125.82.175 as permitted sender) Received: from [74.125.82.175] (HELO mail-we0-f175.google.com) (74.125.82.175) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Sep 2014 04:31:23 +0000 Received: by mail-we0-f175.google.com with SMTP id q59so3013581wes.6 for ; Mon, 29 Sep 2014 21:31:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=UPZLI1KIzqZLduL8bxy1t5CEa6BeMfupPTTkEiTA67U=; b=lDP7UZbznDiHfSlOIximCtMBLuUwMLsTQy2HxDspy2JswCT9BvmjbmNvdu3BUmrvCw L0H2OU1WzMnVsnmDAcGWaAPskNm0DnZHvqnTgJMs+PuSqk0Cp/vfz17qmHhbsVqrn58U LqmfBlVUVclOApD9CSuhcgi94qHaFXch6+egWZmcuTfzT4EShergzmlhMrjlewf3hgcD QnmzKuZ3Lw8EzweN13PGHpqINIAhX9qIswjSeRuD65ojkMC/B2ednVmS/0HoX4oDjqrZ ahgzQIVUJOEAtio0EuNrjkRKNnjDf5cvNFy4zKWc3xfphXi9bhpgerj5A6uTevhUj3/s KnFA== MIME-Version: 1.0 X-Received: by 10.180.79.34 with SMTP id g2mr2662547wix.10.1412051483125; Mon, 29 Sep 2014 21:31:23 -0700 (PDT) Received: by 10.27.13.17 with HTTP; Mon, 29 Sep 2014 21:31:23 -0700 (PDT) In-Reply-To: <2EB23AF5EEEA2140946B8F292EB2EB9F13AC31@QS-PEK-DC1.qilinsoftcorp.qilinsoft.com> References: <2EB23AF5EEEA2140946B8F292EB2EB9F13AC31@QS-PEK-DC1.qilinsoftcorp.qilinsoft.com> Date: Mon, 29 Sep 2014 21:31:23 -0700 Message-ID: Subject: Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin? From: Liquan Pei To: Haopu Wang Cc: dev@spark.apache.org, user Content-Type: multipart/alternative; boundary=f46d041825d80e6f01050440db66 X-Virus-Checked: Checked by ClamAV on apache.org --f46d041825d80e6f01050440db66 Content-Type: text/plain; charset=UTF-8 Hi Haopu, My understanding is that the hashtable on both left and right side is used for including null values in result in an efficient manner. If hash table is only built on one side, let's say left side and we perform a left outer join, for each row in left side, a scan over the right side is needed to make sure that no matching tuples for that row on left side. Hope this helps! Liquan On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang wrote: > I take a look at HashOuterJoin and it's building a Hashtable for both > sides. > > This consumes quite a lot of memory when the partition is big. And it > doesn't reduce the iteration on streamed relation, right? > > Thanks! > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org > For additional commands, e-mail: user-help@spark.apache.org > > -- Liquan Pei Department of Physics University of Massachusetts Amherst --f46d041825d80e6f01050440db66--