Hi Simone,
could you elaborate a little bit on the actual operation you want to
perform. Given a data set {(1, {1,2}), (2, {2,3})} what's the result of
your operation? Is the result { ({1,2}, {1,2,3}) } because the 2 is
contained in both sets?
Cheers,
Till
On Wed, May 25, 2016 at 10:22 AM, Simone Robutti <
simone.robutti@radicalbit.io> wrote:
> Hello,
>
> I'm implementing MinHash for reccomendation on Flink. I'm almost done but
> I need an efficient way to merge sets of similar keys together (and later
> join these sets of keys with more data).
>
> The actual data structure is of the form DataSet[(Int,Set[Int])] where the
> left element of the tuple is an ID for the right element, that is a set of
> keys. I want to merge these sets together only if they share at least one
> element.
>
> I'm rather sure to have studied the efficient solution to this problem in
> a local environment but I don't really know how to treat it in a
> distributed environment. Any suggestion?
>
> Thanks,
>
> Simone
>
