beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nishu <nishuta...@gmail.com>
Subject Re: Data loss in BeamFlinkRunner
Date Fri, 08 Dec 2017 17:37:40 GMT
Hi Eugene,

Does AllowedLateness affect in case of GlobalWindows?  My assumption is
that in case of Global windows,  it will capture entire data due to
accumulation and watermark won't affect.
Please correct me.

Thanks & regards,
Nishu


On Fri, Dec 8, 2017 at 5:43 PM, Eugene Kirpichov <kirpichov@google.com>
wrote:

> Accumulation mode doesn't affect late data dropping: I see your pipeline
> specifies an allowed lateness of zero which means to drop all data that is
> beyond the watermark: is this intentional?
>
> On Fri, Dec 8, 2017, 1:23 AM Nishu <nishutayal@gmail.com> wrote:
>
>> Below is the code, I use for Windows and triggers.
>>
>> *Trigger trigger = Repeatedly.forever(*
>> *
>> AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(delayDuration)));*
>>
>> *// Define windows for Customer Topic*
>> *PCollection<KV<String, String>> customerSubset =
>> customerInput.apply("CustomerWindowParDo",*
>> * Window.<KV<String, String>> into(new
>> GlobalWindows()).triggering(trigger).accumulatingFiredPanes()*
>> * .withAllowedLateness(Duration.standardMinutes(0)));*
>>
>>
>>
>> Thanks & regards,
>> Nishu
>>
>> On Fri, Dec 8, 2017 at 10:19 AM, Nishu <nishutayal@gmail.com> wrote:
>>
>>> Hi Eugene,
>>>
>>> In my usecase, I use GlobalWindow (https://beam.apache.org/
>>> documentation/programming-guide/#provided-windowing-functions ) and
>>> specify the trigger. In GLobal Window, entire data is accumulated every
>>> time the trigger fires. so that we can avoid the late data issue.
>>>
>>> I found a JIRA issue(https://issues.apache.org/jira/browse/BEAM-3225 )
>>> for the same issue in Beam.
>>>
>>> Today I am going to try to write similar implementation in Flink.
>>>
>>> Thanks,
>>> Nishu
>>>
>>>
>>> On Fri, Dec 8, 2017 at 12:08 AM, Eugene Kirpichov <kirpichov@google.com>
>>> wrote:
>>>
>>>> Most likely this is late data https://beam.apache.org/
>>>> documentation/programming-guide/#watermarks-and-late-data . Try
>>>> configuring a trigger with a late data behavior that is more appropriate
>>>> for your particular use case.
>>>>
>>>> On Thu, Dec 7, 2017 at 3:03 PM Nishu <nishutayal@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am running a Streaming pipeline  with Flink runner.
>>>>> *Operator sequence* is -> Reading the JSON data, Parse JSON String
to
>>>>> the Object and  Group the object based on common key.  I noticed that
>>>>> GroupByKey operator throws away some data in between and hence I don't
get
>>>>> all the keys as output.
>>>>>
>>>>> In the below screenshot, 1001 records are read from kafka Topic , each
>>>>> record has unique ID .  After grouping it returns only 857 unique IDs.
>>>>> Ideally it should return 1001 records from GroupByKey operator.
>>>>>
>>>>>
>>>>> [image: Inline image 3]
>>>>>
>>>>> Any idea, what can be the issue? Thanks in advance!
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Nishu Tayal
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Nishu Tayal
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Nishu Tayal
>>
>


-- 
Thanks & Regards,
Nishu Tayal

Mime
View raw message