flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anyang Hu <huanyang1...@gmail.com>
Subject Re: suggestion of FLINK-10868
Date Sun, 08 Sep 2019 09:25:49 GMT
Hi Peter,

For our online batch task, there is a scene where the failed Container
reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately
exit (the probability of JM loss is greatly improved when thousands of
Containers is to be started). It is found that the JM disconnection (the
reason for JM loss is unknown) will cause the notifyAllocationFailure not
to take effect.

After the introduction of FLINK-13184
<https://jira.apache.org/jira/browse/FLINK-13184> to start  the container
with multi-threaded, the JM disconnection situation has been alleviated. In
order to stably implement the client immediate exit, we use the following
code to determine  whether call onFatalError when
MaximumFailedTaskManagerExceedingException is occurd:

@Override
public void notifyAllocationFailure(JobID jobId, AllocationID
allocationId, Exception cause) {
   validateRunsInMainThread();

   JobManagerRegistration jobManagerRegistration =
jobManagerRegistrations.get(jobId);
   if (jobManagerRegistration != null) {
      jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId,
cause);
   } else {
      if (exitProcessOnJobManagerTimedout) {
         ResourceManagerException exception = new
ResourceManagerException("Job Manager is lost, can not notify
allocation failure.");
         onFatalError(exception);
      }
   }
}


Best regards,

Anyang

Mime
View raw message