What is the suggested practice for joining two large data streams? I am currently simply mapping out the key tuple on both streams then executing a join.
I have seen several suggestions for broadcast joins that seem to be targeted at a joining a larger data set to a small set (broadcasting the smaller set).
For joining two large datasets, it would seem to be better to repartition both sets in the same way then join each partition. It there a suggested practice for this problem?