From commits-return-23336-archive-asf-public=cust-asf.ponee.io@airflow.incubator.apache.org Sun Sep 23 09:06:04 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 17DD118066B for ; Sun, 23 Sep 2018 09:06:03 +0200 (CEST) Received: (qmail 85150 invoked by uid 500); 23 Sep 2018 07:06:03 -0000 Mailing-List: contact commits-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list commits@airflow.incubator.apache.org Received: (qmail 85140 invoked by uid 99); 23 Sep 2018 07:06:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Sep 2018 07:06:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 929971A1AAD for ; Sun, 23 Sep 2018 07:06:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id LkRxCLjGVLWS for ; Sun, 23 Sep 2018 07:06:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id DEC635F46D for ; Sun, 23 Sep 2018 07:06:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 72BEFE13DA for ; Sun, 23 Sep 2018 07:06:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 33CB323F9F for ; Sun, 23 Sep 2018 07:06:00 +0000 (UTC) Date: Sun, 23 Sep 2018 07:06:00 +0000 (UTC) From: "Holden Karau's magical unicorn (JIRA)" To: commits@airflow.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (AIRFLOW-3046) ECS Operator mistakenly reports success when task is killed due to EC2 host termination MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/AIRFLOW-3046?page=3Dcom.atlass= ian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau's magical unicorn reassigned AIRFLOW-3046: ------------------------------------------------------- Assignee: Holden Karau's magical unicorn > ECS Operator mistakenly reports success when task is killed due to EC2 ho= st termination > -------------------------------------------------------------------------= -------------- > > Key: AIRFLOW-3046 > URL: https://issues.apache.org/jira/browse/AIRFLOW-3046 > Project: Apache Airflow > Issue Type: Bug > Components: contrib, operators > Reporter: Dan MacTough > Assignee: Holden Karau's magical unicorn > Priority: Major > > We have ECS clusters made up of EC2 spot fleets. Among other things, this= means hosts can be terminated on short notice. When this happens, all task= s (and associated containers) get terminated, as well. > We expect that when that happens for Airflow task instances using the ECS= Operator, those instances will be marked as failures and retried. > Instead, they are marked as successful. > As a result, the immediate downstream task fails, causing the scheduled D= AG run to fail. > Here's an example of the Airflow log output when this happens: > {noformat} > [2018-09-12 01:02:02,712] {ecs_operator.py:112} INFO - ECS Task stopped, = check status: {'tasks': [{'taskArn': 'arn:aws:ecs:us-east-1:111111111111:ta= sk/32d43a1d-fbc7-4659-815d-9133bde11cdc', 'clusterArn': 'arn:aws:ecs:us-eas= t-1:111111111111:cluster/processing', 'taskDefinitionArn': 'arn:aws:ecs:us-= east-1:111111111111:task-definition/foobar-testing_dataEngineering_rd:76', = 'containerInstanceArn': 'arn:aws:ecs:us-east-1:111111111111:container-insta= nce/7431f0a6-8fc5-4eff-8196-32f77d286a61', 'overrides': {'containerOverride= s': [{'name': 'foobar-testing', 'command': ['./bin/generate-features.sh', '= 2018-09-11']}]}, 'lastStatus': 'STOPPED', 'desiredStatus': 'STOPPED', 'cpu'= : '4096', 'memory': '60000', 'containers': [{'containerArn': 'arn:aws:ecs:u= s-east-1:111111111111:container/0d5cc553-f894-4f9a-b17c-9f80f7ce8d0a', 'tas= kArn': 'arn:aws:ecs:us-east-1:111111111111:task/32d43a1d-fbc7-4659-815d-913= 3bde11cdc', 'name': 'foobar-testing', 'lastStatus': 'RUNNING', 'networkBind= ings': [], 'networkInterfaces': [], 'healthStatus': 'UNKNOWN'}], 'startedBy= ': 'Airflow', 'version': 3, 'stoppedReason': 'Host EC2 (instance i-02cf23bb= d5ae26194) terminated.', 'connectivity': 'CONNECTED', 'connectivityAt': dat= etime.datetime(2018, 9, 12, 0, 6, 30, 245000, tzinfo=3Dtzlocal()), 'pullSta= rtedAt': datetime.datetime(2018, 9, 12, 0, 6, 32, 748000, tzinfo=3Dtzlocal(= )), 'pullStoppedAt': datetime.datetime(2018, 9, 12, 0, 6, 59, 748000, tzinf= o=3Dtzlocal()), 'createdAt': datetime.datetime(2018, 9, 12, 0, 6, 30, 24500= 0, tzinfo=3Dtzlocal()), 'startedAt': datetime.datetime(2018, 9, 12, 0, 7, 0= , 748000, tzinfo=3Dtzlocal()), 'stoppingAt': datetime.datetime(2018, 9, 12,= 1, 2, 0, 91000, tzinfo=3Dtzlocal()), 'stoppedAt': datetime.datetime(2018, = 9, 12, 1, 2, 0, 91000, tzinfo=3Dtzlocal()), 'group': 'family:foobar-testing= _dataEngineering_rd', 'launchType': 'EC2', 'attachments': [], 'healthStatus= ': 'UNKNOWN'}], 'failures': [], 'ResponseMetadata': {'RequestId': '758c791f= -b627-11e8-83f7-2b76f4796ed2', 'HTTPStatusCode': 200, 'HTTPHeaders': {'serv= er': 'Server', 'date': 'Wed, 12 Sep 2018 01:02:02 GMT', 'content-type': 'ap= plication/x-amz-json-1.1', 'content-length': '1412', 'connection': 'keep-al= ive', 'x-amzn-requestid': '758c791f-b627-11e8-83f7-2b76f4796ed2'}, 'RetryAt= tempts': 0}}{noformat} > I believe the function that checks whether the task is successful needs a= t least one more check.=C2=A0 > We are currently running a modified version of the ECS Operator that cont= ains the following {{_check_success_task}}=C2=A0function to address this fa= ilure condition: > {code} > def _check_success_task(self): > response =3D self.client.describe_tasks( > cluster=3Dself.cluster, > tasks=3D[self.arn] > ) > self.log.info('ECS Task stopped, check status: %s', response) > if len(response.get('failures', [])) > 0: > raise AirflowException(response) > for task in response['tasks']: > if 'terminated' in task.get('stoppedReason', '').lower(): > raise AirflowException('The task was stopped because the = host instance terminated: {}'.format( > task.get('stoppedReason', ''))) > containers =3D task['containers'] > for container in containers: > if container.get('lastStatus') =3D=3D 'STOPPED' and \ > container['exitCode'] !=3D 0: > raise AirflowException( > 'This task is not in success state {}'.format(tas= k)) > elif container.get('lastStatus') =3D=3D 'PENDING': > raise AirflowException( > 'This task is still pending {}'.format(task)) > elif 'error' in container.get('reason', '').lower(): > raise AirflowException( > 'This containers encounter an error during launch= ing : {}'. > format(container.get('reason', '').lower())) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)