flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: [QUESTION] the differences between DataStream.join() and DataStream.coGroup()
Date Thu, 19 May 2016 07:20:41 GMT
Hi,

you are right, at them moment join() looks like syntactic sugar around
coGroup(). Internally, it calls wraps a FlatJoinFunction in a
CoGroupFunction and calls DataStream.coGroup().
This can be done because CoGroup is more generic and can be used to execute
a Join. However, there can be also more efficient strategies to execute a
join because join is more specialized.

Providing an API for join has several benefits:
- the implementation can be improved without affecting the user
- The DataStream API is more similar to the DataSet API which might help
users that touch both APIs.
- Join anc CoGroup are similar, but also different operations. CoGroup
looks at full group of elements with the same key. Join only at pairs of
elements with identical keys. Due to SQL, the concept of a join is probably
better known than coGroup.

Best, Fabian

2016-05-19 9:05 GMT+02:00 Jark Wu <wuchong.wc@alibaba-inc.com>:

> I have read the source code , and found that the JoinedStreams'
> implementation code is almost the same with CoGroupedStreams' (internally
> JoinedStreams' implementation is based on CoGroupedStreams). So why we
> provide two different interface `DataStream.join()` and
> `DataStream.coGroup()` which are exactly the same ?  And the document[1]
> has not indicated they are doing the same thing. Or is there any
> differences between `DataStream.join()` and `DataStream.coGroup()` which I
> missed ?
> -- Jark Wu
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message