flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xintong Song (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-23573) Tests fail with AdaptiveScheduler due to exceptions in logs trying to offer slots after JobMaster shutdown
Date Mon, 02 Aug 2021 11:28:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-23573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391523#comment-17391523
] 

Xintong Song commented on FLINK-23573:
--------------------------------------

Thanks [~chesnay]. I'll add it to the whitelist.

> Tests fail with AdaptiveScheduler due to exceptions in logs trying to offer slots after
JobMaster shutdown
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-23573
>                 URL: https://issues.apache.org/jira/browse/FLINK-23573
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Xintong Song
>            Assignee: Xintong Song
>            Priority: Major
>              Labels: test-stability
>             Fix For: 1.14.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20267&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=412
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20396&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=2390
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20454&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=371
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21228&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=7c4a8fb8-eeee-5a77-f518-4176bfae300b&l=2437
> The test failed due to exceptions in logs. I executed the following command from flink-end-to-end-tests/test-scripts/common.sh
on the logs, and it points to the RecipientUnreachableException in TM logs. The problem is
that, TM received extra slot requests from RM after the tasks are finished and slots are freed,
while the JobMaster it tried to offer slots to had already shutdown.
> {code}
> $ grep -rv "GroupCoordinatorNotAvailableException" . \
>    | grep -v "RetriableCommitFailedException" \
>    | grep -v "NoAvailableBrokersException" \
>    | grep -v "Async Kafka commit failed" \
>    | grep -v "DisconnectException" \
>    | grep -v "Cannot connect to ResourceManager right now" \
>    | grep -v "AskTimeoutException" \
>    | grep -v "WARN  akka.remote.transport.netty.NettyTransport" \
>    | grep -v  "WARN  org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline"
\
>    | grep -v 'INFO.*AWSErrorCode' \
>    | grep -v "RejectedExecutionException" \
>    | grep -v "CancellationException" \
>    | grep -v "An exception was thrown by an exception handler" \
>    | grep -v "Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.yarn.exceptions.YarnException"
\
>    | grep -v "Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration"
\
>    | grep -v "java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/exceptions/YarnException"
\
>    | grep -v "java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration" \
>    | grep -v "java.lang.Exception: Execution was suspended" \
>    | grep -v "java.io.InvalidClassException: org.apache.flink.formats.avro.typeutils.AvroSerializer"
\
>    | grep -v "Caused by: java.lang.Exception: JobManager is shutting down" \
>    | grep -v "java.lang.Exception: Artificial failure" \
>    | grep -v "org.apache.flink.runtime.checkpoint.CheckpointException" \
>    | grep -v "org.elasticsearch.ElasticsearchException" \
>    | grep -v "Elasticsearch exception" \
>    | grep -v "org.apache.flink.runtime.JobException: Recovery is suppressed" \
>    | grep -v "WARN  akka.remote.ReliableDeliverySupervisor" \
>    | grep -i "exception"
> ./flink-vsts-taskexecutor-0-fv-az217-107.log:org.apache.flink.runtime.rpc.exceptions.RecipientUnreachableException:
Could not send message [RemoteFencedMessage(00000000000000000000000000000000, RemoteRpcInvocation(null.offerSlots(ResourceID,
Collection, Time)))] from sender [Actor[akka.tcp://flink@10.1.0.175:38955/temp/$0b]] to recipient
[Actor[akka://flink/user/rpc/jobmanager_2#1483449133]], because the recipient is unreachable.
This can either mean that the recipient has been terminated or that the remote RpcService
is currently not reachable.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message