flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Richter <s.rich...@data-artisans.com>
Subject Re: Joining data in Streaming
Date Wed, 31 Jan 2018 09:09:57 GMT
Hi,

if the workarounds that Xingcan and me mentioned are no options for your use-case, then I
think this might currently be the better option. But I would expect some better support for
stream joins in the near future.

Best,
Stefan

> Am 31.01.2018 um 07:04 schrieb Marchant, Hayden <hayden.marchant@citi.com>:
> 
> Stefan,
> 
> So are we essentially saying that in this case, for now, I should stick to DataSet /
Batch Table API?
> 
> Thanks,
> Hayden
> 
> -----Original Message-----
> From: Stefan Richter [mailto:s.richter@data-artisans.com] 
> Sent: Tuesday, January 30, 2018 4:18 PM
> To: Marchant, Hayden [ICG-IT] <hm97833@imceu.eu.ssmb.com>
> Cc: user@flink.apache.org; Aljoscha Krettek <aljoscha@apache.org>
> Subject: Re: Joining data in Streaming
> 
> Hi,
> 
> as far as I know, this is not easily possible. What would be required is something like
a CoFlatmap function, where one input stream is blocking until the second stream is fully
consumed to build up the state to join against. Maybe Aljoscha (in CC) can comment on future
plans to support this.
> 
> Best,
> Stefan
> 
>> Am 30.01.2018 um 12:42 schrieb Marchant, Hayden <hayden.marchant@citi.com>:
>> 
>> We have a use case where we have 2 data sets - One reasonable large data set (a few
million entities), and a smaller set of data. We want to do a join between these data sets.
We will be doing this join after both data sets are available.  In the world of batch processing,
this is pretty straightforward - we'd load both data sets into an application and execute
a join operator on them through a common key.   Is it possible to do such a join using the
DataStream API? I would assume that I'd use the connect operator, though I'm not sure exactly
how I should do the join - do I need one 'smaller' set to be completely loaded into state
before I start flowing the large set? My concern is that if I read both data sets from streaming
sources, since I can't be guaranteed of the order that the data is loaded, I may lose lots
of potential joined entities since their pairs might not have been read yet. 
>> 
>> 
>> Thanks,
>> Hayden Marchant
>> 
>> 
> 


Mime
View raw message