airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ash Berlin-Taylor <...@apache.org>
Subject Re: Database referral integrity
Date Wed, 10 Apr 2019 17:55:46 GMT
I am all for FKs.

How do you think we should handle the case of "Xcom but missing TIs" (or similar) that we
might run into on installs when a user runs `airflow upgradedb`?

-a

> On 10 Apr 2019, at 18:44, Driesprong, Fokko <fokko@driesprong.frl> wrote:
> 
> Reviving this discussion again :-)
> 
> Lately, I was digging into the PR of Julian regarding adding FK's to the
> database: https://github.com/apache/airflow/pull/4922
> 
> After digging into the details, I've noticed that the current situation
> regarding referential integrity is bad. It is not uncommon the have
> DagRun's without having the DAG in the database. For example, you can do a
> backfill job before the scheduler persisted the DAG in the database. I also
> think this is often the case in the UI, where we the nuke when some of the
> models haven't been persisted in the database. Therefore I'd like to
> suggest to enforce consistency by foreign keys. This will prevent us from
> having DagRuns without DAGs, but also removing xcom objects of
> TaskInstances that are already removed.
> 
> To create an overview of the FK's, I've created subtasks the ticket of
> Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904
> 
> WDYT?
> 
> Cheers, Fokko
> 
> 
> 
> Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin <
> maximebeauchemin@gmail.com>:
> 
>> The database migration creating the FK will/would need to have something
>> that either creates dummy missing PKs first, or delete the orphaned keys to
>> insure the operation of creating the FK doesn't error out. Seems like
>> adding dummy keys is a better approach. Then you'll have to make sure that
>> everywhere where FKs are created that there are no edge cases of missing
>> PKs. Then some delete operations in some cases may have to "cascade"
>> properly.
>> 
>> The Django Admin had this nice confirm screen on delete that would show you
>> clearly the scope of the cascading operation when deleting objects. To my
>> knowledge Flask-Admin and FAB don't have such a feature. Personally I
>> wouldn't allow cascade unless such a feature is implemented in some way.
>> Note that SQLAlchemy has builtin semantics for specifying how/whether
>> cascading should take place.
>> 
>> Personally I think referential integrity is overrated in some cases,
>> especially when using meaningful "business keys" (as opposed to
>> auto-increted "surrogate" keys) as PKs. It also slows down insert
>> operations. For data warehousing (this clearly doesn't apply to the Airflow
>> metadata store), the best practice on most db engine is to **not** enforce
>> FK constraints as it slows down inserts in fact tables and straight out
>> prevents bulk loading.
>> 
>> Another approach is to avoid deleting in general, especially referenced
>> fks, and setting up some activity/visibility flag to false instead.
>> 
>> Max
>> 
>> On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko <fokko@driesprong.frl>
>> wrote:
>> 
>>> I'm in favor of having referential integrity. It will add some load in
>>> having to enforce the referential integrity, but it will also make sure
>>> that the database stays clean. Also in Airflow we use transactions which
>>> will make sure that the integrity checks are not validated on every
>>> statement, but after the commit. I'm happy to help with this as well.
>>> 
>>> Cheers, Fokko
>>> 
>>> Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bdbruin@gmail.com>:
>>> 
>>>> Adding these kind of checks which work for integrity well make database
>>>> access pretty slow. In addition it isnt there because in the past there
>>> was
>>>> no strong connection between for example tasks and dagruns, it was more
>>> or
>>>> less just coincidental. There also so some bisecting tools that
>> probably
>>>> have difficulty functioning in a new regime. In other words it is not
>> an
>>>> easy change and it will have operational challenges.
>>>> 
>>>>> On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <ash@apache.org> wrote:
>>>>> 
>>>>> Ooh good spot.
>>>>> 
>>>>> Yes I would be in favour of adding these, but as you say we need to
>>>> thing about how we might migrate old data.
>>>>> 
>>>>> Doing this at 2.0.0 and providing a cleanup script (or doing it as
>> part
>>>> of the migration?) is probably the way to go.
>>>>> 
>>>>> -ash-
>>>>> 
>>>>>> On 17 Sep 2018, at 19:56, Stefan Seelmann <mail@stefan-seelmann.de>
>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> looking into the DB schema there is almost no referral integrity
>>>>>> enforced at the database level. Many foreign key constraints between
>>>>>> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
>>> sense
>>>>>> IMO.
>>>>>> 
>>>>>> Is there a particular reason why that's not implemented?
>>>>>> 
>>>>>> Introducing it now will be hard, probably any real-world setup has
>>> some
>>>>>> violations. But I'm still in favor of this additional safety net.
>>>>>> 
>>>>>> Kind Regards,
>>>>>> Stefan
>>>>> 
>>>> 
>>>> 
>>> 
>> 


Mime
View raw message