From user-return-29691-archive-asf-public=cust-asf.ponee.io@flink.apache.org Sun Sep 8 09:26:10 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 9DBF218062C for ; Sun, 8 Sep 2019 11:26:10 +0200 (CEST) Received: (qmail 8021 invoked by uid 500); 8 Sep 2019 09:26:11 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 8011 invoked by uid 99); 8 Sep 2019 09:26:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 08 Sep 2019 09:26:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 9EFC41808EF for ; Sun, 8 Sep 2019 09:26:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.55 X-Spam-Level: ** X-Spam-Status: No, score=2.55 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 17orV2Z8TPzy for ; Sun, 8 Sep 2019 09:26:06 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.222.44; helo=mail-ua1-f44.google.com; envelope-from=huanyang1024@gmail.com; receiver= Received: from mail-ua1-f44.google.com (mail-ua1-f44.google.com [209.85.222.44]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 22031BC7E2 for ; Sun, 8 Sep 2019 09:26:06 +0000 (UTC) Received: by mail-ua1-f44.google.com with SMTP id u18so3375519uap.2 for ; Sun, 08 Sep 2019 02:26:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=GqKu0mqDxcm3Dv9WOU1Sv2IIZ6FfC+D1VJ8YuxXRWJ0=; b=B2lLKEqH8CO9uXr+fx+lgp3V5J/cERNXu7f2AaZyv0ideQwuxuNl5mBfOp5cdiYP8I lhOREEgU/IrCnR6VwX3MFfIxUpLK4xTlai7iWXUFCMy70uIWnYH6QG/Cud4yd6zODXXt sWNM1sploh8AR6VHublRuyDi80kU9xVVY9AAqinb46BhP4n/rpUD6jBVxunNThX4wMKs vEywgFoD5G+E18xA8LPulFDNDmKM/BXbp5WI3ounENivF0wnPWItRR5NxW1xytfFlJvH ysH598r0Ttfyxv2lTedYi7N4/Uur7mYUCIxK9vJ2pXZdsOeRhRnJpChnUtQCLJOpNHOd aqEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=GqKu0mqDxcm3Dv9WOU1Sv2IIZ6FfC+D1VJ8YuxXRWJ0=; b=SKD2B/U1k/GYDh71BfjmWbbPAwtuSXllzYBVFLOX/5Xrf7bgTRUldpAi+wHwXrduZ7 nrDLOp7sOinLdgfxrwovUOiwEySFoa01OYq/yVr4ZXNRS0Tg4n7LE+Ux5udPTUBvuHnH nqyYrgXydaTzLBIX1fkSOeMt8+OwHJHGLYURbAM1MxlMRk0p8ReSOEe1XDvYqRJI/+D3 0Q7Tq1+TEneZuKq8OrvVEkG9NXqYx9wyUf++P5X1S6l4U5vFo65AI0rNkqDuPwUZgeew lqonxAxYATDivJ3iYbSD08/4kX1NzMgn3S3s5fXmQMvXaoO9eX1s5GFbYQTjdlLgifES xFsA== X-Gm-Message-State: APjAAAWXZM+w2+ebKTITR+7HMsmNoMoKPv9+ny1SsymTNB7V5uU6111V x6TlMKnCGWrMKC9Lk6+IsP7SXC3WHTJJ5890h0c= X-Google-Smtp-Source: APXvYqxt630NDqHNjwn/tyKAU2h0yyOgFTR8CCk/F5FPGLP/iSUDaw+wJsRLX9I9B7+5Jfks4OmK5YAMdqpnJY44FiA= X-Received: by 2002:ab0:602e:: with SMTP id n14mr8164671ual.17.1567934760235; Sun, 08 Sep 2019 02:26:00 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Anyang Hu Date: Sun, 8 Sep 2019 17:25:49 +0800 Message-ID: Subject: Re: suggestion of FLINK-10868 To: Peter Huang Cc: Till Rohrmann , user , qi luo , snake.fly318@gmail.com Content-Type: multipart/alternative; boundary="0000000000006a382005920742a2" --0000000000006a382005920742a2 Content-Type: text/plain; charset="UTF-8" Hi Peter, For our online batch task, there is a scene where the failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately exit (the probability of JM loss is greatly improved when thousands of Containers is to be started). It is found that the JM disconnection (the reason for JM loss is unknown) will cause the notifyAllocationFailure not to take effect. After the introduction of FLINK-13184 to start the container with multi-threaded, the JM disconnection situation has been alleviated. In order to stably implement the client immediate exit, we use the following code to determine whether call onFatalError when MaximumFailedTaskManagerExceedingException is occurd: @Override public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) { validateRunsInMainThread(); JobManagerRegistration jobManagerRegistration = jobManagerRegistrations.get(jobId); if (jobManagerRegistration != null) { jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause); } else { if (exitProcessOnJobManagerTimedout) { ResourceManagerException exception = new ResourceManagerException("Job Manager is lost, can not notify allocation failure."); onFatalError(exception); } } } Best regards, Anyang --0000000000006a382005920742a2 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Peter,

For our online batch ta= sk, there is a scene where the failed Container reaches MAXIMUM_WORKERS_FAI= LURE_RATE but the client will not immediately exit (the probability of JM l= oss is greatly improved when thousands of Containers is to be started). It = is found that the JM disconnection (the reason for JM loss is unknown) will= cause the notifyAllocationFailure not to take effect.=C2=A0

After = the introduction of FLINK-13184 to start =C2=A0the container with multi-threaded, the = JM disconnection situation has been alleviated. In order to stably implemen= t the client immediate exit, we use the following code to determine =C2=A0w= hether call onFatalError when MaximumFailedTaskManagerExceedingException is= occurd:

@Override
public void notifyAllocationFailure(JobID = jobId, AllocationID allocationId, Exception cause) {
validateRunsInMainThread();

JobManagerRegistration jobManagerRegistration =3D =
jobManagerReg= istrations.get(jobId);
if (jobManagerRegistration !=3D null) {
jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocat= ionId, cause);
} else
{
if (exitProcessOnJobManagerTimedout) {
ResourceManagerException exception =3D new ResourceManagerException(&quo= t;Job Manager is lost, can not notify allocation failure.");
onFatal= Error(exception);
}
}
}

Best regards,
Anyang<=
/pre>
--0000000000006a382005920742a2--