From commits-return-71630-archive-asf-public=cust-asf.ponee.io@airflow.apache.org Wed Oct 23 00:08:29 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8590B18062C for ; Wed, 23 Oct 2019 02:08:29 +0200 (CEST) Received: (qmail 86469 invoked by uid 500); 23 Oct 2019 00:08:28 -0000 Mailing-List: contact commits-help@airflow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.apache.org Delivered-To: mailing list commits@airflow.apache.org Received: (qmail 86460 invoked by uid 99); 23 Oct 2019 00:08:28 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Oct 2019 00:08:28 +0000 From: GitBox To: commits@airflow.apache.org Subject: [GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability Message-ID: <157178930876.10642.23280021706964332.gitbox@gitbox.apache.org> Date: Wed, 23 Oct 2019 00:08:28 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#discussion_r337801737 ########## File path: docs/howto/enable-dag-serialization.rst ########## @@ -0,0 +1,109 @@ + .. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + + + + +Enable DAG Serialization +======================== + +Add the following settings in ``airflow.cfg``: + +.. code-block:: ini + + [core] + store_serialized_dags = True + min_serialized_dag_update_interval = 30 + +* ``store_serialized_dags``: This flag decides whether to serialises DAGs and persist them in DB. + If set to True, Webserver reads from DB instead of parsing DAG files +* ``min_serialized_dag_update_interval``: This flag sets the minimum interval (in seconds) after which + the serialized DAG in DB should be updated. This helps in reducing database write rate. + +If you are updating Airflow from <1.10.6, please do not forget to run ``airflow db upgrade``. + + +How it works +------------ + +In order to make Airflow Webserver stateless (almost!), Airflow >=1.10.6 supports +DAG Serialization and DB Persistence. + +.. image:: ../img/dag_serialization.png + +As shown in the image above in Vanilla Airflow, the Webserver and the Scheduler both +needs access to the DAG files. Both the scheduler and webserver parses the DAG files. + +With **DAG Serialization** we aim to decouple the webserver from DAG parsing +which would make the Webserver very light-weight. + +As shown in the image above, when using the dag_serilization feature, +the Scheduler parses the DAG files, serializes them in JSON format and saves them in the Metadata DB. + +The Webserver now instead of having to parse the DAG file again, reads the +serialized DAGs in JSON, de-serializes them and create the DagBag and uses it +to show in the UI. + +One of the key features that is implemented as the part of DAG Serialization is that +instead of loading an entire DagBag when the WebServer starts we only load each DAG on demand from the +Serialized Dag table. This helps reduce Webserver startup time and memory. The reduction is notable +when you have large number of DAGs. + +Below is the screenshot of the ``serialized_dag`` table in Metadata DB: + +.. image:: ../img/serialized_dag_table.png + +Limitations +----------- +The Webserver will still need access to DAG files in the following cases, +which is why we said "almost" stateless. + +* **Rendered Template** tab will still have to parse Python file as it needs all the details like + the execution date and even the data passed by the upstream task using Xcom. +* **Code View** will read the DAG File & show it using Pygments. + However, it does not need to Parse the Python file so it is still a small operation. +* :doc:`Extra Operator Links ` would not work out of + the box. They need to be defined in Airflow Plugin. + + **Existing Airflow Operators**: + To make extra operator links work with existing operators like BigQuery, copy all + the classes that are defined in ``operator_extra_links`` property. Review comment: Looks like the currently defined OperatorLinks for inbuilt operators aredependent on Instance of operator object. Example the following is Qubole Operator: ``` class QDSLink(BaseOperatorLink): """Link to QDS""" name = 'Go to QDS' def get_link(self, operator, dttm): return operator.get_hook().get_extra_links(operator, dttm) ``` which depends on `conn = BaseHook.get_connection(operator.kwargs['qubole_conn_id'])` line. We don't store `qubole_conn_id` as the task property in our serialized DAG. and the BigqueryOperator is: ``` @property def operator_extra_links(self): """ Return operator extra lxinks """ if isinstance(self.sql, str): return ( BigQueryConsoleLink(), ) return ( BigQueryConsoleIndexableLink(i) for i, _ in enumerate(self.sql) ) ``` this one is dependent on `self.sql` property !! which the Serialized Operator won't have access to as we don't store that info in serialized DAG. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services