Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
Message-ID: <562eb3ce.83218c0a.6f98a.46a2@mx.google.com>
MIME-Version: 1.0
To: user <user@spark.apache.org>
From: Bryan <bryan.jeffrey@gmail.com>
Subject: Joining large data sets
Date: Mon, 26 Oct 2015 19:13:46 -0400
Content-Type: multipart/alternative;
	boundary="_00DABD06-A03B-449D-8FEB-530472D68D6A_"

--_00DABD06-A03B-449D-8FEB-530472D68D6A_
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="Windows-1252"

Hello.

What is the suggested practice for joining two large data streams? I am cur=
rently simply mapping out the key tuple on both streams then executing a jo=
in.

I have seen several suggestions for broadcast joins that seem to be targete=
d at a joining a larger data set to a small set (broadcasting the smaller s=
et).

 For joining two large datasets, it would seem to be better to repartition =
both sets in the same way then join each partition. It there a suggested pr=
actice for this problem?

Thank you,

Bryan Jeffrey=

--_00DABD06-A03B-449D-8FEB-530472D68D6A_
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="Windows-1252"

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; charset=
=3Dwindows-1252"></head><body><div><div style=3D"font-family: Calibri,sans-=
serif; font-size: 11pt;">Hello.<br><br>What is the suggested practice for j=
oining two large data streams? I am currently simply mapping out the key tu=
ple on both streams then executing a join.<br><br>I have seen several sugge=
stions for broadcast joins that seem to be targeted at a joining a larger d=
ata set to a small set (broadcasting the smaller set).<br><br> For joining =
two large datasets, it would seem to be better to repartition both sets in =
the same way then join each partition. It there a suggested practice for th=
is problem?<br><br>Thank you,<br><br>Bryan Jeffrey</div></div></body></html=
>=

--_00DABD06-A03B-449D-8FEB-530472D68D6A_--