flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elias Levy <fearsome.lucid...@gmail.com>
Subject Re: using updating shared data
Date Wed, 02 Jan 2019 19:44:14 GMT
One thing you must be careful of, is that if you are using event time
processing, assuming that the control stream will only receive messages
sporadically, is that event time will stop moving forward in the operator
joining the streams while the control stream is idle.  You can get around
this by using a periodic watermark extractor one the control stream that
bounds the event time delay to processing time or by defining your own low
level operator that ignores watermarks from the control stream.

On Wed, Jan 2, 2019 at 8:42 AM Avi Levi <avi.levi@bluevoyant.com> wrote:

> Thanks Till I will defiantly going to check it. just to make sure that I
> got you correctly. you are suggesting the the list that I want to broadcast
> will be broadcasted via control stream and it will be than be kept in the
> relevant operator state correct ? and updates (CRUD) on that list will be
> preformed via the control stream. correct ?
> BR
> Avi
> On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann <trohrmann@apache.org> wrote:
>> Hi Avi,
>> you could use Flink's broadcast state pattern [1]. You would need to use
>> the DataStream API but it allows you to have two streams (input and control
>> stream) where the control stream is broadcasted to all sub tasks. So by
>> ingesting messages into the control stream you can send model updates to
>> all sub tasks.
>> [1]
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_dev_stream_state_broadcast-5Fstate.html&d=DwMFaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=dpWtkT5FJRWFqDA3MAnB4-dRYGDQjgfQTYAocqGkRKo&m=u5UQh821Gau2wZ7S3M8IRmVpL5JxGADJaq_k7iq6sYo&s=uITdFlQPKLbqxkTux4nR21JhUpLIkS5Pdfi9D_ZSUwE&e=>
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_dev_stream_state_broadcast-5Fstate.html&d=DwQFaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=dpWtkT5FJRWFqDA3MAnB4-dRYGDQjgfQTYAocqGkRKo&m=u5UQh821Gau2wZ7S3M8IRmVpL5JxGADJaq_k7iq6sYo&s=uITdFlQPKLbqxkTux4nR21JhUpLIkS5Pdfi9D_ZSUwE&e=>
>> Cheers,
>> Till
>> On Tue, Jan 1, 2019 at 6:49 PM miki haiat <miko5054@gmail.com> wrote:
>>> Im trying to understand  your  use case.
>>> What is the source  of the data ? FS ,KAFKA else ?
>>> On Tue, Jan 1, 2019 at 6:29 PM Avi Levi <avi.levi@bluevoyant.com> wrote:
>>>> Hi,
>>>> I have a list (couple of thousands text lines) that I need to use in my
>>>> map function. I read this article about broadcasting variables
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_dev_batch_-23broadcast-2Dvariables&d=DwMFaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=dpWtkT5FJRWFqDA3MAnB4-dRYGDQjgfQTYAocqGkRKo&m=u5UQh821Gau2wZ7S3M8IRmVpL5JxGADJaq_k7iq6sYo&s=U3vGeHdL9fGDfP0GNZUkGpSlcVLz9CNLg2MXNwHP0_M&e=>
>>>> using distributed cache
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_dev_batch_-23distributed-2Dcache&d=DwMFaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=dpWtkT5FJRWFqDA3MAnB4-dRYGDQjgfQTYAocqGkRKo&m=u5UQh821Gau2wZ7S3M8IRmVpL5JxGADJaq_k7iq6sYo&s=m5IHbX1Dbz7AYERvVgyxKXmrUQQ06IkA4VCDllkR0HM&e=>
>>>> however I need to update this list from time to time, and if I understood
>>>> correctly it is not possible on broadcast or cache without restarting the
>>>> job. Is there idiomatic way to achieve this? A db seems to be an overkill
>>>> for that and I do want to be cheap on io/network calls as much as possible.
>>>> Cheers
>>>> Avi

View raw message