flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias J. Sax" <mj...@apache.org>
Subject Re: Slinding Window Join (without duplicates)
Date Tue, 24 Nov 2015 10:33:28 GMT
Stephan is right. A tumbling window does not help. The last tuple of
window n and the first tuple of window n+1 are "close" to each other and
should be joined for example.

From a SQL-like point of view this is a very common case expressed as:

SELECT * FROM s1,s2 WHERE s1.key = s2.key AND |s1.ts - s2.ts| < window-size

I would not expect to get any duplicates here.

Basically, the window should move by one tuple (for each stream) and
join with all tuples from the other stream that are within the time
range (window size) were the ts of this new tuple define the boundaries
of the window (ie, there are no "fixed" window boundaries as defined by
a time-slide).

Not sure how a "session window" can help here... I guess using most
generic window API allows to define slide by one tuple and window size X
seconds. But I don't know how duplicates could be avoided...


On 11/24/2015 11:04 AM, Stephan Ewen wrote:
> I understand Matthias' point. You want to join elements that occur within a
> time range of each other.
> In a tumbling window, you have strict boundaries and a pair of elements
> that arrives such that one element is before the boundary and one after,
> they will not join. Hence the sliding windows.
> What may be a solution here is a "session window" join...
> On Tue, Nov 24, 2015 at 10:33 AM, Aljoscha Krettek <aljoscha@apache.org>
> wrote:
>> Hi,
>> I’m not sure this is a problem. If a user specifies sliding windows then
>> one element can (and will) end up in several windows. If these are joined
>> then there will be multiple results. If the user does not want multiple
>> windows then tumbling windows should be used.
>> IMHO, this is quite straightforward. But let’s see what others have to say.
>> Cheers,
>> Aljoscha
>>> On 23 Nov 2015, at 20:36, Matthias J. Sax <mjsax@apache.org> wrote:
>>> Hi,
>>> it seems that a join on the data streams with an overlapping sliding
>>> window produces duplicates in the output. The default implementation
>>> internally just use two nested-loops over both windows to compute the
>>> result.
>>> How can duplicates be avoided? Is there any way after all right now? If
>>> not, should be add this?
>>> -Matthias

View raw message