flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aljoscha Krettek <aljos...@apache.org>
Subject Re: Akka Quarantine & Old YARN Versions
Date Fri, 04 Aug 2017 09:59:02 GMT
Hi Konstantin,

If you can at all wait, I would suggest to skip updating to 1.3.1 and go directly to (the
not yet released) 1.3.2. Flink 1.3.0 and 1.3.1 had a few critical bugs that are not fixed.
Most notably, there was a problem in the Kafka consumer that could lead to state corruption/data
duplication and incremental RocksDB checkpoints were not working correctly in some cases.

The vote for 1.3.2 is currently ongoing and the release should happen tomorrow or by Monday
at the latest.

Best,
Aljoscha

> On 4. Aug 2017, at 11:09, Nico Kruber <nico@data-artisans.com> wrote:
> 
> Hi Konstantin,
> I just checked the code and the configuration option is still there and should 
> be working. Somehow, the backport for the 1.2 release branch did contain the 
> documentation while the actual commit on master did not.
> Thanks for the info, let me create a hotfix to fix that.
> 
> 
> Nico
> 
> On Thursday, 3 August 2017 18:05:29 CEST Konstantin Knauf wrote:
>> Hi Nico,
>> 
>> thanks for the quick response! No, this was note enabled :( Since we are
>> in the process of upgrading to 1.3.1: I did not find this option in 1.3,
>> only 1.2. Is this the default behaviour in 1.3 or is this configuration
>> just not documented?
>> 
>> Cheers,
>> 
>> Konstantin
>> 
>> On 03.08.2017 17:11, Nico Kruber wrote:
>>> Hi Konstantin,
>>> I digged through the linked pull requests (of
>>> https://issues.apache.org/jira/ browse/FLINK-3347) a bit just to notice
>>> that the fix-version tag was wrong (should have been 1.2.1, not 1.2.0)
>>> but you have that already.
>>> 
>>> In there, it was also mentioned that the quarantine monitor is disabled by
>>> default and can be enabled by setting
>>> `taskmanager.exit-on-fatal-akka-error` to true. If enabled, it should
>>> detect a quarantined task manager and shut it down. In that case, YARN
>>> should notice it and start a new one, if I'm not mistaken.
>>> 
>>> Are you already working with `taskmanager.exit-on-fatal-akka-error`
>>> enabled?
>>> 
>>> 
>>> Nico
>>> 
>>> On Thursday, 3 August 2017 10:53:00 CEST Konstantin Knauf wrote:
>>>> Hi everyone,
>>>> 
>>>> we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :().
>>>> Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are
>>>> experiencing regular TaskManager failures due to
>>>> 
>>>> [Taskmanager Logs]
>>>> 2017-07-10 15:25:26,448 ERROR Remoting
>>>> 
>>>>                   - Association to
>>>> 
>>>> [akka.tcp://flink@<jobmanager>:45303] with UID [-382428140]
>>>> irrecoverably failed. Quarantining address.
>>>> java.lang.IllegalStateException: Error encountered while processing
>>>> system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}]
>>>> 
>>>>        at
>>>> 
>>>> akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(End
>>>> poi nt.scala:289) at
>>>> akka.actor.Actor$class.aroundReceive(Actor.scala:467)>> 
>>>>        at ...
>>>> 
>>>> As far as I understand https://issues.apache.org/jira/browse/FLINK-3345
>>>> the taskmanager should be restarted in this case. In our case YARN does
>>>> not start a new taskmanager container, but the container is just missing
>>>> indefinitely. Is it known, that this does not work on YARN 2.4?
>>>> 
>>>> If it helps, I can also provide the full job and taskmanager logs...
>>>> 
>>>> Cheers & Thanks,
>>>> 
>>>> Konstantin
> 


Mime
View raw message