From dev-return-9145-archive-asf-public=cust-asf.ponee.io@airflow.apache.org Mon Aug 12 06:33:11 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 90DC1180637 for ; Mon, 12 Aug 2019 08:33:11 +0200 (CEST) Received: (qmail 18704 invoked by uid 500); 12 Aug 2019 06:33:10 -0000 Mailing-List: contact dev-help@airflow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.apache.org Delivered-To: mailing list dev@airflow.apache.org Received: (qmail 18692 invoked by uid 99); 12 Aug 2019 06:33:09 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Aug 2019 06:33:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 3B4B6C053E for ; Mon, 12 Aug 2019 06:33:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.801 X-Spam-Level: * X-Spam-Status: No, score=1.801 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=go-jek.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id khJduPwnZO5n for ; Mon, 12 Aug 2019 06:33:06 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.217.46; helo=mail-vs1-f46.google.com; envelope-from=maulik.soneji@go-jek.com; receiver= Received: from mail-vs1-f46.google.com (mail-vs1-f46.google.com [209.85.217.46]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 39573BC7E2 for ; Mon, 12 Aug 2019 06:33:06 +0000 (UTC) Received: by mail-vs1-f46.google.com with SMTP id i128so174234vsc.7 for ; Sun, 11 Aug 2019 23:33:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=go-jek.com; s=google; h=mime-version:from:date:message-id:subject:to; bh=nmUcUoPQmCscb65GHY4a+5VegnWOczQ59A1pKNOt0JM=; b=FmlkBQgyTA1B40uUj9Vm07SPBRoh5GDHcV2oQliS1R7k4PqJyLIiBNi+Ti6FP39T4C jTobqax68LyElH/KivpAOmWY91clFVDuMiW6fugC1NbKmADxaPSqRIYhXotzK6/SCyMf EBsJWUUcmgUXuMhrE5h9knjWAfnyJEAmG4c5ldUAB1MOaz940uK/gstlt6PDIPlTj/zj 7vxws/XGgSxQmHEpNxqTAq3quSyiPwD2Q0lDKCte2JLEzRtcj4paiARiykp2yaD949Xc cBy5uB3nYzD/Hivdq2G5t7QRtuXb5negi7pNHft8aJNXxPzKLyWc2BuHRdac89OzrM2n ZYmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=nmUcUoPQmCscb65GHY4a+5VegnWOczQ59A1pKNOt0JM=; b=pSqn8hmlzQu2Nj6/an0laJC1VUPJHBqrq+9OI+LI6rK66PTxqBSFqVrwWfqOii9shV XnNqz1pEPLrNsJpLo4pOlODzjT+sD3rpWDsqUhqHg9cOQjd/d2ITYjwQp8Fv+605WUS1 tsRd8BhKHTdzWVq5fXxHzkQBYogE9mDyeN2HGrITbzFysy1+MD4TpZ66UOztlSVIu2iS sjbH1Ttm9jwQq60OvTSAmGHw9UIMaSV2CHTiTC+eHZSKV1xAiuOdcAMf1f3bnV1I255z b0dZyL5R2GN6zM1cgAKSUS8Rs3nt+pDdjOsi7mjYYDDMU1Ds/dYi9VkN+tBgiiFTg/TK 8AVg== X-Gm-Message-State: APjAAAXkknC0uERuP7Jtr6mKWB4FPWPIu2EfZGzG2Uci648ICZXKdYjt JiuC7+44kHHXb0RjGoE5kHjTiIrrAWjJxRLwqv/am8B0XzQ= X-Google-Smtp-Source: APXvYqwkYZThsG9CoseChTqe9RmVZ8Yhw3zlxal4tfrhvbJvrzd8ipxZAB2HQb7JNthnclfYIDexVoAKjif8KZsEXEY= X-Received: by 2002:a67:c016:: with SMTP id v22mr20856103vsi.107.1565591579505; Sun, 11 Aug 2019 23:32:59 -0700 (PDT) MIME-Version: 1.0 From: Maulik Soneji Date: Mon, 12 Aug 2019 12:02:23 +0530 Message-ID: Subject: Facing dag_id not found when syncing dags zip from GCS To: dev@airflow.apache.org Content-Type: multipart/alternative; boundary="000000000000f5c81d058fe5b11a" --000000000000f5c81d058fe5b11a Content-Type: text/plain; charset="UTF-8" Hello everyone, We are using airflow for scheduling our batch processing jobs and it has been a great tool for scheduling our DAGs. Recently we are trying to package our dags along with dependencies into a zip. This zip is then uploaded to GCS and then we sync these dags from GCS and mount it into a PersistentVolumeClaim(PVC) on kubernetes. Since these airflow dags have to be synced from GCS to airflow, we have followed few strategies, but nothing has worked reliably. We are currently running *airflow version 1.10.2*. Our current setup is as follows: As this involves multiple components, I have split it down in steps. *Step 1*. Syncing dags using GCSFuse We bundle the dependencies along with dags and upload it to GCS. Gcsfuse(https://github.com/GoogleCloudPlatform/gcsfuse) is used inorder to sync the dags into the dags_folder property of airflow. We have created a airflow user with UID and GID of 1000, and the dags folder has been created with the airflow user. We run the following command to sync the airflow dags: gcsfuse --uid=1000 --gid=1000 --key-file my-secret.json /home/airflow/airflow/dags *Step 2*. Mounting the dags to PVC This dags folder is not able to mount on our PVC maybe because the folder has been created by airflow user and it expects it to be root user. So inorder to mount the folder to PVC, we periodically(period of 1 minute) copy the contents of the dags folder to dagsmount folder(using rsync command). Command is: rsync -r --delete dags/ dagsmount/ The dagsmount folder is created with root user, so it is able to sync the dagsmount folder to PVC and not the dags folder. In this setup, we are facing issues of intermittently getting either "dag_id not found" or "zip file doesnt exist" errors. Our current assumption as to why we are getting these errors is that since we are continuously syncing the airflow dags, there are reads and writes happening in parallel which is causing the errors. Here is the sample stack trace we get for these errors: [2019-08-10 12:00:36,754] {models.py:412} ERROR - Failed to import: /home/airflow/airflow/dags/packaged_dags.zip Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/airflow/models.py", line 409, in process_file m = importlib.import_module(mod_name) File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 963, in _find_and_load_unlocked File "", line 906, in _find_spec File "", line 1280, in find_spec File "", line 1254, in _get_spec File "", line 1235, in _legacy_get_spec File "", line 441, in spec_from_loader File "", line 594, in spec_from_file_location FileNotFoundError: [Errno 2] No such file or directory: '/home/airflow/airflow/dags/packaged_dags.zip' DEBUG - Calling callbacks: [] INFO - Job 3063: Subtask export_bigquery_wallet_table /usr/local/lib/python3.7/site-packages/airflow/utils/helpers.py:356: DeprecationWarning: Importing 'BashOperator' directly from 'airflow.operators' has been deprecated. Please import from 'airflow.operators.[operator_module]' instead. Support for direct imports will be dropped entirely in Airflow 2.0. INFO - Job 3063: Subtask export_bigquery_wallet_table DeprecationWarning) INFO - Job 3063: Subtask export_bigquery_wallet_table /usr/local/lib/python3.7/site-packages/airflow/contrib/operators/bigquery_operator.py:172: DeprecationWarning: Deprecated parameter `bql` used in Task id: export_bigquery_brand_table. Use `sql` parameter instead to pass the sql to be executed. `bql` parameter is deprecated and will be removed in a future version of Airflow. INFO - Job 3063: Subtask export_bigquery_wallet_table category=DeprecationWarning) INFO - Job 3063: Subtask export_bigquery_wallet_table /usr/local/lib/python3.7/site-packages/airflow/contrib/operators/bigquery_operator.py:172: DeprecationWarning: Deprecated parameter `bql` used in Task id: export_bigquery_brand_peak_order_time_table. Use `sql` parameter instead to pass the sql to be executed. `bql` parameter is deprecated and will be removed in a future version of Airflow. INFO - Job 3063: Subtask export_bigquery_wallet_table category=DeprecationWarning) INFO - Job 3063: Subtask export_bigquery_wallet_table [2019-08-10 12:00:36,760] {settings.py:201} DEBUG - Disposing DB connection pool (PID 37) INFO - Job 3063: Subtask export_bigquery_wallet_table Traceback (most recent call last): INFO - Job 3063: Subtask export_bigquery_wallet_table File "/usr/local/bin/airflow", line 32, in INFO - Job 3063: Subtask export_bigquery_wallet_table args.func(args) INFO - Job 3063: Subtask export_bigquery_wallet_table File "/usr/local/lib/python3.7/site-packages/airflow/utils/cli.py", line 74, in wrapper INFO - Job 3063: Subtask export_bigquery_wallet_table return f(*args, **kwargs) INFO - Job 3063: Subtask export_bigquery_wallet_table File "/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 504, in run INFO - Job 3063: Subtask export_bigquery_wallet_table dag = get_dag(args) INFO - Job 3063: Subtask export_bigquery_wallet_table File "/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 149, in get_dag INFO - Job 3063: Subtask export_bigquery_wallet_table 'parse.'.format(args.dag_id)) INFO - Job 3063: Subtask export_bigquery_wallet_table airflow.exceptions.AirflowException: dag_id could not be found: customer_summary_integration.customer_profile. Either the dag did not exist or it failed to parse. I have tried to cover all details, let me know if anything is unclear. -- Thanks and Regards, Maulik Soneji --000000000000f5c81d058fe5b11a--