From dev-return-8033-archive-asf-public=cust-asf.ponee.io@airflow.apache.org Wed Apr 10 17:56:29 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 4B3B3180626 for ; Wed, 10 Apr 2019 19:56:29 +0200 (CEST) Received: (qmail 97753 invoked by uid 500); 10 Apr 2019 17:55:49 -0000 Mailing-List: contact dev-help@airflow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.apache.org Delivered-To: mailing list dev@airflow.apache.org Received: (qmail 97739 invoked by uid 99); 10 Apr 2019 17:55:48 -0000 Received: from Unknown (HELO mailrelay2-lw-us.apache.org) (10.10.3.159) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Apr 2019 17:55:48 +0000 Received: from themisto.localdomain (231.25.169.217.in-addr.arpa [217.169.25.231]) by mailrelay2-lw-us.apache.org (ASF Mail Server at mailrelay2-lw-us.apache.org) with ESMTPSA id 57E962FCD for ; Wed, 10 Apr 2019 17:55:48 +0000 (UTC) From: Ash Berlin-Taylor Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Database referral integrity Date: Wed, 10 Apr 2019 18:55:46 +0100 References: <001E4B92-0B61-418B-B57D-2755A6404AB0@gmail.com> To: dev@airflow.apache.org In-Reply-To: Message-Id: <64BF0426-1252-4384-AFC6-B7EDC188E929@apache.org> X-Mailer: Apple Mail (2.3273) I am all for FKs. How do you think we should handle the case of "Xcom but missing TIs" (or = similar) that we might run into on installs when a user runs `airflow = upgradedb`? -a > On 10 Apr 2019, at 18:44, Driesprong, Fokko = wrote: >=20 > Reviving this discussion again :-) >=20 > Lately, I was digging into the PR of Julian regarding adding FK's to = the > database: https://github.com/apache/airflow/pull/4922 >=20 > After digging into the details, I've noticed that the current = situation > regarding referential integrity is bad. It is not uncommon the have > DagRun's without having the DAG in the database. For example, you can = do a > backfill job before the scheduler persisted the DAG in the database. I = also > think this is often the case in the UI, where we the nuke when some of = the > models haven't been persisted in the database. Therefore I'd like to > suggest to enforce consistency by foreign keys. This will prevent us = from > having DagRuns without DAGs, but also removing xcom objects of > TaskInstances that are already removed. >=20 > To create an overview of the FK's, I've created subtasks the ticket of > Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904 >=20 > WDYT? >=20 > Cheers, Fokko >=20 >=20 >=20 > Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin < > maximebeauchemin@gmail.com>: >=20 >> The database migration creating the FK will/would need to have = something >> that either creates dummy missing PKs first, or delete the orphaned = keys to >> insure the operation of creating the FK doesn't error out. Seems like >> adding dummy keys is a better approach. Then you'll have to make sure = that >> everywhere where FKs are created that there are no edge cases of = missing >> PKs. Then some delete operations in some cases may have to "cascade" >> properly. >>=20 >> The Django Admin had this nice confirm screen on delete that would = show you >> clearly the scope of the cascading operation when deleting objects. = To my >> knowledge Flask-Admin and FAB don't have such a feature. Personally I >> wouldn't allow cascade unless such a feature is implemented in some = way. >> Note that SQLAlchemy has builtin semantics for specifying how/whether >> cascading should take place. >>=20 >> Personally I think referential integrity is overrated in some cases, >> especially when using meaningful "business keys" (as opposed to >> auto-increted "surrogate" keys) as PKs. It also slows down insert >> operations. For data warehousing (this clearly doesn't apply to the = Airflow >> metadata store), the best practice on most db engine is to **not** = enforce >> FK constraints as it slows down inserts in fact tables and straight = out >> prevents bulk loading. >>=20 >> Another approach is to avoid deleting in general, especially = referenced >> fks, and setting up some activity/visibility flag to false instead. >>=20 >> Max >>=20 >> On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko = >> wrote: >>=20 >>> I'm in favor of having referential integrity. It will add some load = in >>> having to enforce the referential integrity, but it will also make = sure >>> that the database stays clean. Also in Airflow we use transactions = which >>> will make sure that the integrity checks are not validated on every >>> statement, but after the commit. I'm happy to help with this as = well. >>>=20 >>> Cheers, Fokko >>>=20 >>> Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin = : >>>=20 >>>> Adding these kind of checks which work for integrity well make = database >>>> access pretty slow. In addition it isnt there because in the past = there >>> was >>>> no strong connection between for example tasks and dagruns, it was = more >>> or >>>> less just coincidental. There also so some bisecting tools that >> probably >>>> have difficulty functioning in a new regime. In other words it is = not >> an >>>> easy change and it will have operational challenges. >>>>=20 >>>>> On 18 Sep 2018, at 11:03, Ash Berlin-Taylor = wrote: >>>>>=20 >>>>> Ooh good spot. >>>>>=20 >>>>> Yes I would be in favour of adding these, but as you say we need = to >>>> thing about how we might migrate old data. >>>>>=20 >>>>> Doing this at 2.0.0 and providing a cleanup script (or doing it as >> part >>>> of the migration?) is probably the way to go. >>>>>=20 >>>>> -ash- >>>>>=20 >>>>>> On 17 Sep 2018, at 19:56, Stefan Seelmann = >>>> wrote: >>>>>>=20 >>>>>> Hi, >>>>>>=20 >>>>>> looking into the DB schema there is almost no referral integrity >>>>>> enforced at the database level. Many foreign key constraints = between >>>>>> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would = make >>> sense >>>>>> IMO. >>>>>>=20 >>>>>> Is there a particular reason why that's not implemented? >>>>>>=20 >>>>>> Introducing it now will be hard, probably any real-world setup = has >>> some >>>>>> violations. But I'm still in favor of this additional safety net. >>>>>>=20 >>>>>> Kind Regards, >>>>>> Stefan >>>>>=20 >>>>=20 >>>>=20 >>>=20 >>=20