flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kirschnick, Johannes" <johannes.kirschn...@tu-berlin.de>
Subject Re: Join with a custom predicate
Date Mon, 27 Apr 2015 10:16:38 GMT
Hi,

thanks for dedicating some of your weekend time to this problem.
Your solution looks quite neat, thanks ..  

For your information, what I'm trying to do is multiply matrix blocks (dataset one) with individual
elements of a vector (dataset two, of row id and some value).
In this case "one" needs to see all the qualifying entries from "two" to correctly multiply,
ideally as a group for streamlined multiplication.

Johannes

-----Urspr√ľngliche Nachricht-----
Von: Alexander Alexandrov [mailto:alexander.s.alexandrov@gmail.com] 
Gesendet: Sonntag, 26. April 2015 23:22
An: dev@flink.apache.org
Betreff: Re: Join with a custom predicate

I thought about your problem over the weekend. Unfortunately the algorithm that you describe
does not fit "regular" equi-join semantics, but I think it could be "fitted" with a more complex
dataflow.

To achieve that, I would partition the (active) domain of the two datasets on fine-granular
intervals (for the sake of the example, let's say 10.

You can prepare a "coarse-grained" join key on the inputs using a "x % 10"
(Flat)Map:

One: (0, {3,6}), (0, {5,7})
Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7)

Upon that you can do a regular join on the "coarse-grained" key (in the first component of
the tuples), and follow that with a filter that evaluates the actual "one.start <= two.number
<= one.end" predicate.

Regards,
Alex


2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes <
johannes.kirschnick@tu-berlin.de>:

> Hi
> I have a small problem with doing a custom join, that I would need 
> some help with. Maybe I'm also approaching the problem wrong.
> So basically I have two dataset.
> The simplified example: The first one has a start and end value. The 
> second dataset is just a list of ordered numbers and some value (value 
> is ignored in the example) Example One = {3,6},{5,7} Two = 
> 1,2,3,4,5,6,7 What I need is a sort of custom join, that brings to the 
> first dataset all elements from the second that are within the range.
> Something like .. join where one.start <= two.number <= one.end So 
> {3,6} from one would only need to "see" 3,4,5 Joining does not work 
> out of the box here as the key is sort of "dynamic"
> depending on the value of one.
> I can just use a map for the first dataset and broadcast the second 
> into the mapper which can then select the required elements - but my 
> assumption is that the second dataset might actually be very large as 
> well, but the qualifying join "numbers" from two will actually be small.
> Is there something I could do in this particular case?
> Thanks a lot
> Johannes
>
Mime
View raw message