flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Knauf <konstantin.kn...@tngtech.com>
Subject Re: Akka Quarantine & Old YARN Versions
Date Thu, 03 Aug 2017 16:05:29 GMT
Hi Nico,

thanks for the quick response! No, this was note enabled :( Since we are
in the process of upgrading to 1.3.1: I did not find this option in 1.3,
only 1.2. Is this the default behaviour in 1.3 or is this configuration
just not documented?

Cheers,

Konstantin

On 03.08.2017 17:11, Nico Kruber wrote:
> Hi Konstantin,
> I digged through the linked pull requests (of https://issues.apache.org/jira/
> browse/FLINK-3347) a bit just to notice that the fix-version tag was wrong 
> (should have been 1.2.1, not 1.2.0) but you have that already.
> 
> In there, it was also mentioned that the quarantine monitor is disabled by 
> default and can be enabled by setting `taskmanager.exit-on-fatal-akka-error` 
> to true. If enabled, it should detect a quarantined task manager and shut it 
> down. In that case, YARN should notice it and start a new one, if I'm not 
> mistaken.
> 
> Are you already working with `taskmanager.exit-on-fatal-akka-error` enabled?
> 
> 
> Nico
> 
> On Thursday, 3 August 2017 10:53:00 CEST Konstantin Knauf wrote:
>> Hi everyone,
>>
>> we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :().
>> Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are
>> experiencing regular TaskManager failures due to
>>
>> [Taskmanager Logs]
>> 2017-07-10 15:25:26,448 ERROR Remoting
>>                    - Association to
>> [akka.tcp://flink@<jobmanager>:45303] with UID [-382428140]
>> irrecoverably failed. Quarantining address.
>> java.lang.IllegalStateException: Error encountered while processing
>> system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}]
>>         at
>> akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoi
>> nt.scala:289) at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>>         at ...
>>
>> As far as I understand https://issues.apache.org/jira/browse/FLINK-3345
>> the taskmanager should be restarted in this case. In our case YARN does
>> not start a new taskmanager container, but the container is just missing
>> indefinitely. Is it known, that this does not work on YARN 2.4?
>>
>> If it helps, I can also provide the full job and taskmanager logs...
>>
>> Cheers & Thanks,
>>
>> Konstantin
> 

-- 
Konstantin Knauf * konstantin.knauf@tngtech.com * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082


Mime
View raw message