Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0AF5B18EBF for ; Mon, 26 Oct 2015 23:15:09 +0000 (UTC) Received: (qmail 94107 invoked by uid 500); 26 Oct 2015 23:14:30 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 94000 invoked by uid 500); 26 Oct 2015 23:14:30 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 93990 invoked by uid 99); 26 Oct 2015 23:14:30 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Oct 2015 23:14:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 08BDD1A0AD4 for ; Mon, 26 Oct 2015 23:14:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.898 X-Spam-Level: ** X-Spam-Status: No, score=2.898 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id xmQhoXD2n3uV for ; Mon, 26 Oct 2015 23:14:29 +0000 (UTC) Received: from mail-qg0-f41.google.com (mail-qg0-f41.google.com [209.85.192.41]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 49942429AA for ; Mon, 26 Oct 2015 23:14:29 +0000 (UTC) Received: by qgad10 with SMTP id d10so133123636qga.3 for ; Mon, 26 Oct 2015 16:14:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:mime-version:to:from:subject:date:content-type; bh=ypiZFhXX3ZUePH+W1yZGTdGLGSMpbhMrmlfsiAVnKrk=; b=SNDJ5J3SuRwuxVFZrAb4LDgMERjkrUrBMGp7BFPBTsqTN0e7FUCPuWFRi/CKGo3t8w 4b02vS8JD0LVRXWLItBGElp3MoiZJEpr4OULXxw6syVLYCbDXADNwHNTsypINYft6J29 uiSi1YVpvHybncKPP5no2uvgtd3lmM4xZHFabZ2/c5AuIXJQgjsZL0aCjJq30knujOM7 O49SD5XcgiZQSUhd7WPrImFAi5b6fEesjWNznWGvfKkwSo27UQ8QR4SkBrmC9IgenPDI WCcfmjC39O3xOdQaYewFKAIJUK/38i0Pxw4bI6NbiQwoTKtrnrI+vdi6BZdXYbNYzKVW J6ww== X-Received: by 10.140.96.53 with SMTP id j50mr45634074qge.100.1445901263157; Mon, 26 Oct 2015 16:14:23 -0700 (PDT) Received: from ?IPv6:2601:18f:301:5240:2e5a:5ff:fea3:871c? ([2601:18f:301:5240:2e5a:5ff:fea3:871c]) by smtp.gmail.com with ESMTPSA id j3sm13988642qgj.44.2015.10.26.16.14.22 for (version=SSLv3 cipher=RC4-SHA bits=128/128); Mon, 26 Oct 2015 16:14:22 -0700 (PDT) Message-ID: <562eb3ce.83218c0a.6f98a.46a2@mx.google.com> MIME-Version: 1.0 To: user From: Bryan Subject: Joining large data sets Date: Mon, 26 Oct 2015 19:13:46 -0400 Content-Type: multipart/alternative; boundary="_00DABD06-A03B-449D-8FEB-530472D68D6A_" --_00DABD06-A03B-449D-8FEB-530472D68D6A_ Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="Windows-1252" Hello. What is the suggested practice for joining two large data streams? I am cur= rently simply mapping out the key tuple on both streams then executing a jo= in. I have seen several suggestions for broadcast joins that seem to be targete= d at a joining a larger data set to a small set (broadcasting the smaller s= et). For joining two large datasets, it would seem to be better to repartition = both sets in the same way then join each partition. It there a suggested pr= actice for this problem? Thank you, Bryan Jeffrey= --_00DABD06-A03B-449D-8FEB-530472D68D6A_ Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="Windows-1252"
Hello.

What is the suggested practice for j= oining two large data streams? I am currently simply mapping out the key tu= ple on both streams then executing a join.

I have seen several sugge= stions for broadcast joins that seem to be targeted at a joining a larger d= ata set to a small set (broadcasting the smaller set).

For joining = two large datasets, it would seem to be better to repartition both sets in = the same way then join each partition. It there a suggested practice for th= is problem?

Thank you,

Bryan Jeffrey
= --_00DABD06-A03B-449D-8FEB-530472D68D6A_--