airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Guziel <alex.guz...@airbnb.com.INVALID>
Subject Re: Database referral integrity
Date Wed, 10 Apr 2019 21:49:29 GMT
Kevin Yang has done some benchmarking as well.

On Wed, Apr 10, 2019 at 2:32 PM Gabriel Silk <gsilk@dropbox.com.invalid>
wrote:

> The main concern I have with FK's is the potential performance impact.
>
> When evaluating Airflow for use at Dropbox, I ran benchmarks that showed
> several bottlenecks in the query patterns being used by the scheduler,
> which led me to open https://issues.apache.org/jira/browse/AIRFLOW-2430
> and
> submit a PR.
>
> If we move forward with FKs, we need to do a robust set of benchmarks to
> understand how this will affect users who are running large Airflow
> clusters.
>
> On Wed, Apr 10, 2019 at 1:05 PM Alex Guziel <alex.guziel@airbnb.com
> .invalid>
> wrote:
>
> > I'm not a huge fan of having foreign keys. I know Airbnb has and
> definitely
> > still has problems with DB load. I don't see any real convincing
> arguments
> > for how adding referential integrity will improve Airflow meaningfully
> > (yet).
> >
> > On Wed, Apr 10, 2019 at 12:38 PM Bas Harenslak <
> > basharenslak@godatadriven.com> wrote:
> >
> > > In my experience it could go either way; in some cases FKs could impair
> > > the performance and in some cases FKs can help the query optimizer
> > improve
> > > query performance. Each case is different and without testing it’s just
> > > guessing.
> > >
> > > I’m in favour of adding FKs and value referential integrity over
> > > performance. If you’re sacrificing integrity for performance you’re
> doing
> > > either advanced funky stuff or the wrong thing. I haven’t seen the
> > database
> > > being a bottleneck in Airflow, even with large setups (+-5000 DAGs). So
> > why
> > > not add FKs and performance test some Airflow queries, to know for sure
> > :-)
> > >
> > > Bas
> > >
> > > On 10 Apr 2019, at 20:39, Bolke de Bruin <bdbruin@gmail.com<mailto:
> > > bdbruin@gmail.com>> wrote:
> > >
> > > Please not that Fks can be quite slow...
> > >
> > > Verstuurd vanaf mijn iPad
> > >
> > > Op 10 apr. 2019 om 19:55 heeft Ash Berlin-Taylor <ash@apache.org
> <mailto:
> > > ash@apache.org>> het volgende geschreven:
> > >
> > > I am all for FKs.
> > >
> > > How do you think we should handle the case of "Xcom but missing TIs"
> (or
> > > similar) that we might run into on installs when a user runs `airflow
> > > upgradedb`?
> > >
> > > -a
> > >
> > > On 10 Apr 2019, at 18:44, Driesprong, Fokko <fokko@driesprong.frl
> > <mailto:
> > > fokko@driesprong.frl>> wrote:
> > >
> > > Reviving this discussion again :-)
> > >
> > > Lately, I was digging into the PR of Julian regarding adding FK's to
> the
> > > database: https://github.com/apache/airflow/pull/4922
> > >
> > > After digging into the details, I've noticed that the current situation
> > > regarding referential integrity is bad. It is not uncommon the have
> > > DagRun's without having the DAG in the database. For example, you can
> do
> > a
> > > backfill job before the scheduler persisted the DAG in the database. I
> > also
> > > think this is often the case in the UI, where we the nuke when some of
> > the
> > > models haven't been persisted in the database. Therefore I'd like to
> > > suggest to enforce consistency by foreign keys. This will prevent us
> from
> > > having DagRuns without DAGs, but also removing xcom objects of
> > > TaskInstances that are already removed.
> > >
> > > To create an overview of the FK's, I've created subtasks the ticket of
> > > Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904
> > >
> > > WDYT?
> > >
> > > Cheers, Fokko
> > >
> > >
> > >
> > > Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin <
> > > maximebeauchemin@gmail.com<mailto:maximebeauchemin@gmail.com>>:
> > >
> > > The database migration creating the FK will/would need to have
> something
> > > that either creates dummy missing PKs first, or delete the orphaned
> keys
> > to
> > > insure the operation of creating the FK doesn't error out. Seems like
> > > adding dummy keys is a better approach. Then you'll have to make sure
> > that
> > > everywhere where FKs are created that there are no edge cases of
> missing
> > > PKs. Then some delete operations in some cases may have to "cascade"
> > > properly.
> > >
> > > The Django Admin had this nice confirm screen on delete that would show
> > you
> > > clearly the scope of the cascading operation when deleting objects. To
> my
> > > knowledge Flask-Admin and FAB don't have such a feature. Personally I
> > > wouldn't allow cascade unless such a feature is implemented in some
> way.
> > > Note that SQLAlchemy has builtin semantics for specifying how/whether
> > > cascading should take place.
> > >
> > > Personally I think referential integrity is overrated in some cases,
> > > especially when using meaningful "business keys" (as opposed to
> > > auto-increted "surrogate" keys) as PKs. It also slows down insert
> > > operations. For data warehousing (this clearly doesn't apply to the
> > Airflow
> > > metadata store), the best practice on most db engine is to **not**
> > enforce
> > > FK constraints as it slows down inserts in fact tables and straight out
> > > prevents bulk loading.
> > >
> > > Another approach is to avoid deleting in general, especially referenced
> > > fks, and setting up some activity/visibility flag to false instead.
> > >
> > > Max
> > >
> > > On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko
> <fokko@driesprong.frl
> > > <mailto:fokko@driesprong.frl>>
> > > wrote:
> > >
> > > I'm in favor of having referential integrity. It will add some load in
> > > having to enforce the referential integrity, but it will also make sure
> > > that the database stays clean. Also in Airflow we use transactions
> which
> > > will make sure that the integrity checks are not validated on every
> > > statement, but after the commit. I'm happy to help with this as well.
> > >
> > > Cheers, Fokko
> > >
> > > Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bdbruin@gmail.com
> > > <mailto:bdbruin@gmail.com>>:
> > >
> > > Adding these kind of checks which work for integrity well make database
> > > access pretty slow. In addition it isnt there because in the past there
> > > was
> > > no strong connection between for example tasks and dagruns, it was more
> > > or
> > > less just coincidental. There also so some bisecting tools that
> > > probably
> > > have difficulty functioning in a new regime. In other words it is not
> > > an
> > > easy change and it will have operational challenges.
> > >
> > > On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <ash@apache.org<mailto:
> > > ash@apache.org>> wrote:
> > >
> > > Ooh good spot.
> > >
> > > Yes I would be in favour of adding these, but as you say we need to
> > > thing about how we might migrate old data.
> > >
> > > Doing this at 2.0.0 and providing a cleanup script (or doing it as
> > > part
> > > of the migration?) is probably the way to go.
> > >
> > > -ash-
> > >
> > > On 17 Sep 2018, at 19:56, Stefan Seelmann <mail@stefan-seelmann.de
> > <mailto:
> > > mail@stefan-seelmann.de>>
> > > wrote:
> > >
> > > Hi,
> > >
> > > looking into the DB schema there is almost no referral integrity
> > > enforced at the database level. Many foreign key constraints between
> > > dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
> > > sense
> > > IMO.
> > >
> > > Is there a particular reason why that's not implemented?
> > >
> > > Introducing it now will be hard, probably any real-world setup has
> > > some
> > > violations. But I'm still in favor of this additional safety net.
> > >
> > > Kind Regards,
> > > Stefan
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message