flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lasse Nedergaard <lassenedergaardfl...@gmail.com>
Subject Re: Latency tracking together with broadcast state can cause job failure
Date Wed, 22 Apr 2020 10:54:31 GMT
Hi Yun

Thanks for looking into it and forwarded it to the right place. 


Med venlig hilsen / Best regards
Lasse Nedergaard


> Den 22. apr. 2020 kl. 11.06 skrev Yun Tang <myasuka@live.com>:
> 
> 
> Hi Lasse
> 
> After debug locally, this should be a bug in Flink (even the latest version). However,
the bug should be caused in network stack with which I am not very familiar and not so easy
to find root cause directly. After discussion with our network guys in Flink, we decide to
first create FLINK-17322 [1] to track this problem, and related owner would take a look at
this problem.
> 
> Really thank you for reporting this bug.
> 
> [1] https://issues.apache.org/jira/browse/FLINK-17322
> 
> Best
> Yun Tang
> From: Yun Tang <myasuka@live.com>
> Sent: Wednesday, April 22, 2020 1:43
> To: Lasse Nedergaard <lassenedergaardflink@gmail.com>
> Cc: user <user@flink.apache.org>
> Subject: Re: Latency tracking together with broadcast state can cause job failure
>  
> Hi Lasse
> 
> Really sorry for missing your reply. I'll run your project and find the root cause in
my day time. And thanks for @Robert Metzger 's kind remind.
> 
> Best
> Yun Tang
> From: Robert Metzger <rmetzger@apache.org>
> Sent: Tuesday, April 21, 2020 20:01
> To: Lasse Nedergaard <lassenedergaardflink@gmail.com>
> Cc: Yun Tang <myasuka@live.com>; user <user@flink.apache.org>
> Subject: Re: Latency tracking together with broadcast state can cause job failure
>  
> Hey Lasse,
> has the problem been resolved?
> 
> (I'm also responding to this to make sure the thread gets attention again :) )
> 
> Best,
> Robert
> 
> 
>> On Wed, Apr 1, 2020 at 10:03 PM Lasse Nedergaard <lassenedergaardflink@gmail.com>
wrote:
>> Hi
>> 
>> I have attached a simple project with a test that reproduce the problem. The normal
fault is a mixed string but you can also EOF exception. 
>> Please let me know if you have any questions to the solution. 
>> 
>> Med venlig hilsen / Best regards
>> Lasse Nedergaard
>> 
>> 
>> Den 1. apr. 2020 kl. 09.15 skrev Yun Tang <myasuka@live.com>:
>> 
>> 
>> Hi Lasse
>> 
>> Never meet this problem before, but can you share some exception stack trace so that
we could take a look. The simple project to reproduce is also a good choice.
>> 
>> Best
>> Yun Tang
>> From: Lasse Nedergaard <lassenedergaardflink@gmail.com>
>> Sent: Tuesday, March 31, 2020 19:10
>> To: user <user@flink.apache.org>
>> Subject: Latency tracking together with broadcast state can cause job failure
>>  
>> Hi
>> 
>> We have in both Flink 1.9.2 and 1.10 struggled with random deserialze and Index out
of range exception in one of our job. We also get out of memory exceptions. 
>> We have now identified it as a latency tracking together with broadcast state Causing
the problem. When we do integration testing locally we don’t see any problem it’s only
fails running on the cluster. 
>> We have concluded that latency tracking package send over broadcast cause the data
stream to be corrupted and causing the exceptions. 
>> We work on preparing a simple project on github to reproduce the problem so the underlying
problem can be solved. 
>> 
>> Anyone else have seen these kind of problems?
>> 
>> Med venlig hilsen / Best regards
>> Lasse Nedergaard
>> 

Mime
View raw message