airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Guziel <alex.guz...@airbnb.com.INVALID>
Subject Re: Database referral integrity
Date Wed, 10 Apr 2019 20:05:00 GMT
I'm not a huge fan of having foreign keys. I know Airbnb has and definitely
still has problems with DB load. I don't see any real convincing arguments
for how adding referential integrity will improve Airflow meaningfully
(yet).

On Wed, Apr 10, 2019 at 12:38 PM Bas Harenslak <
basharenslak@godatadriven.com> wrote:

> In my experience it could go either way; in some cases FKs could impair
> the performance and in some cases FKs can help the query optimizer improve
> query performance. Each case is different and without testing it’s just
> guessing.
>
> I’m in favour of adding FKs and value referential integrity over
> performance. If you’re sacrificing integrity for performance you’re doing
> either advanced funky stuff or the wrong thing. I haven’t seen the database
> being a bottleneck in Airflow, even with large setups (+-5000 DAGs). So why
> not add FKs and performance test some Airflow queries, to know for sure :-)
>
> Bas
>
> On 10 Apr 2019, at 20:39, Bolke de Bruin <bdbruin@gmail.com<mailto:
> bdbruin@gmail.com>> wrote:
>
> Please not that Fks can be quite slow...
>
> Verstuurd vanaf mijn iPad
>
> Op 10 apr. 2019 om 19:55 heeft Ash Berlin-Taylor <ash@apache.org<mailto:
> ash@apache.org>> het volgende geschreven:
>
> I am all for FKs.
>
> How do you think we should handle the case of "Xcom but missing TIs" (or
> similar) that we might run into on installs when a user runs `airflow
> upgradedb`?
>
> -a
>
> On 10 Apr 2019, at 18:44, Driesprong, Fokko <fokko@driesprong.frl<mailto:
> fokko@driesprong.frl>> wrote:
>
> Reviving this discussion again :-)
>
> Lately, I was digging into the PR of Julian regarding adding FK's to the
> database: https://github.com/apache/airflow/pull/4922
>
> After digging into the details, I've noticed that the current situation
> regarding referential integrity is bad. It is not uncommon the have
> DagRun's without having the DAG in the database. For example, you can do a
> backfill job before the scheduler persisted the DAG in the database. I also
> think this is often the case in the UI, where we the nuke when some of the
> models haven't been persisted in the database. Therefore I'd like to
> suggest to enforce consistency by foreign keys. This will prevent us from
> having DagRuns without DAGs, but also removing xcom objects of
> TaskInstances that are already removed.
>
> To create an overview of the FK's, I've created subtasks the ticket of
> Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904
>
> WDYT?
>
> Cheers, Fokko
>
>
>
> Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin <
> maximebeauchemin@gmail.com<mailto:maximebeauchemin@gmail.com>>:
>
> The database migration creating the FK will/would need to have something
> that either creates dummy missing PKs first, or delete the orphaned keys to
> insure the operation of creating the FK doesn't error out. Seems like
> adding dummy keys is a better approach. Then you'll have to make sure that
> everywhere where FKs are created that there are no edge cases of missing
> PKs. Then some delete operations in some cases may have to "cascade"
> properly.
>
> The Django Admin had this nice confirm screen on delete that would show you
> clearly the scope of the cascading operation when deleting objects. To my
> knowledge Flask-Admin and FAB don't have such a feature. Personally I
> wouldn't allow cascade unless such a feature is implemented in some way.
> Note that SQLAlchemy has builtin semantics for specifying how/whether
> cascading should take place.
>
> Personally I think referential integrity is overrated in some cases,
> especially when using meaningful "business keys" (as opposed to
> auto-increted "surrogate" keys) as PKs. It also slows down insert
> operations. For data warehousing (this clearly doesn't apply to the Airflow
> metadata store), the best practice on most db engine is to **not** enforce
> FK constraints as it slows down inserts in fact tables and straight out
> prevents bulk loading.
>
> Another approach is to avoid deleting in general, especially referenced
> fks, and setting up some activity/visibility flag to false instead.
>
> Max
>
> On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko <fokko@driesprong.frl
> <mailto:fokko@driesprong.frl>>
> wrote:
>
> I'm in favor of having referential integrity. It will add some load in
> having to enforce the referential integrity, but it will also make sure
> that the database stays clean. Also in Airflow we use transactions which
> will make sure that the integrity checks are not validated on every
> statement, but after the commit. I'm happy to help with this as well.
>
> Cheers, Fokko
>
> Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bdbruin@gmail.com
> <mailto:bdbruin@gmail.com>>:
>
> Adding these kind of checks which work for integrity well make database
> access pretty slow. In addition it isnt there because in the past there
> was
> no strong connection between for example tasks and dagruns, it was more
> or
> less just coincidental. There also so some bisecting tools that
> probably
> have difficulty functioning in a new regime. In other words it is not
> an
> easy change and it will have operational challenges.
>
> On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <ash@apache.org<mailto:
> ash@apache.org>> wrote:
>
> Ooh good spot.
>
> Yes I would be in favour of adding these, but as you say we need to
> thing about how we might migrate old data.
>
> Doing this at 2.0.0 and providing a cleanup script (or doing it as
> part
> of the migration?) is probably the way to go.
>
> -ash-
>
> On 17 Sep 2018, at 19:56, Stefan Seelmann <mail@stefan-seelmann.de<mailto:
> mail@stefan-seelmann.de>>
> wrote:
>
> Hi,
>
> looking into the DB schema there is almost no referral integrity
> enforced at the database level. Many foreign key constraints between
> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
> sense
> IMO.
>
> Is there a particular reason why that's not implemented?
>
> Introducing it now will be hard, probably any real-world setup has
> some
> violations. But I'm still in favor of this additional safety net.
>
> Kind Regards,
> Stefan
>
>
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message