airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bas Harenslak <basharens...@godatadriven.com>
Subject [DISCUSS] AIP-12 Persist DAG into DB
Date Sat, 23 Feb 2019 19:40:52 GMT
Let's discuss AIP-12 here: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-12+Persist+DAG+into+DB.
It involves persisting the entire DAG into the metastore. For full details, please read the
AIP.

A PR was made to create “versioned graphs” given by option #3 in the AIP: https://github.com/apache/airflow/pull/4396.
This led to a long discussion but has been quiet for the last few days. It would be sad to
see the effort put in by @ffinfo not leading to anything. To recap the summary at the end
of the PR, the current status is:

Internal changes:


  *   This PR persists task dependencies in a new table `dag_edge`.
  *   The term "graph" is introduced in the code, this contains the structure of a DAG, so
the "edges" (dependencies) and "nodes" (tasks).
  *   A DagRun is bound to one `graph_id`.
  *   Currently in Airflow only the latest version of a DAG is displayed in the UI (both graph
& tree view). This means if you delete a task, you cannot see runs of that task in the
past anymore.
  *   In the graph view you can now see different graph versions, because we store both tasks
and edges.
  *   For the record: in the tree view you still only the latest version because it is not
possible to combine all history into a single view.

Changes from a user perspective:


  *   Nothing in the tree view.
  *   In the graph view, you can now view different "graphs" if you change the structure of
your DAG. Note the graph view shows DAG runs. If you change your DAG without running it, it
does not show in the graph view.
  *   When you have no DAG runs, there is no graph to show. So, as @ffinfo described above
he then reads the graph from the DAG file instead. You can see this behaviour in the graph
view url:
     *   if DagRuns exist: http://host/graph?dag_id=my_dag
     *   if no DagRuns exist: http://host/graph?dag_id=my_dag&read_from_file=True
  *   In the screenshots in https://github.com/apache/airflow/pull/4396#issuecomment-465217731,
you see this case. Since this is more of an internal thing how Airflow works, and not really
informative for the user, @ffinfo removed the message in his last commit.

Judging by the PR comments, everybody likes the idea of persisting more of the DAG in the
DB. All issues mentioned were addressed. It would be great to see this work merged in Airflow,
so please discuss anything about the PR/AIP here.

Cheers,
Bas
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message