spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan <>
Subject Joining large data sets
Date Mon, 26 Oct 2015 23:13:46 GMT

What is the suggested practice for joining two large data streams? I am currently simply mapping
out the key tuple on both streams then executing a join.

I have seen several suggestions for broadcast joins that seem to be targeted at a joining
a larger data set to a small set (broadcasting the smaller set).

 For joining two large datasets, it would seem to be better to repartition both sets in the
same way then join each partition. It there a suggested practice for this problem?

Thank you,

Bryan Jeffrey
View raw message