airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Driesprong, Fokko" <fo...@driesprong.frl>
Subject Re: Airflow DAG Serialisation
Date Sat, 27 Jul 2019 07:54:18 GMT
Looks great Zhou,

I have one thing that pops in my mind while reading the AIP; should keep
the caching on the webserver level. As the famous quote goes: *"There are
only two hard things in Computer Science: cache invalidation and naming
things." -- Phil Karlton*

Right now, the fundamental change that is being proposed in the AIP is
fetching the DAGs from the database in a serialized format, and not parsing
the Python files all the time. This will give already a great performance
improvement on the webserver side because it removes a lot of the
processing. However, since we're still fetching the DAGs from the database
in a regular interval, cache it in the local process, so we still have the
two issues that Airflow is suffering from right now:

   1. No snappy UI because it is still polling the database in a regular
   interval.
   2. Inconsistency between webservers because they might poll in a
   different interval, I think we've all seen this:
   https://www.youtube.com/watch?v=sNrBruPS3r4

As I also mentioned in the Slack channel, I strongly feel that we should be
able to render most views from the tables in the database, so without
touching the blob. For specific views, we could just pull the blob from the
database. In this case we always have the latest version, and we tackle the
second point above.

To tackle the first one, I also have an idea. We should change the DAG
parser from a loop to something that uses inotify
https://pypi.org/project/inotify_simple/. This will change it from polling
to an event-driven design, which is much more performant and less resource
hungry. But this would be an AIP on its own.

Again, great design and a comprehensive AIP, but I would include the
caching on the webserver to greatly improve the user experience in the UI.
Looking forward to the opinion of others on this.

Cheers, Fokko








Op za 27 jul. 2019 om 01:44 schreef Zhou Fang <zhoufang@google.com.invalid>:

> Hi Kaxi,
>
> Just sent out the AIP:
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
>
> Thanks!
> Zhou
>
>
> On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang <zhoufang@google.com> wrote:
>
> > Hi Kaxil,
> >
> > We are also working on persisting DAGs into DB using JSON for Airflow
> > webserver in Google Composer. We target at minimizing the change to the
> > current Airflow code. Happy to get synced on this!
> >
> > Here is our progress:
> > (1) Serializing DAGs using Pickle to be used in webserver
> > It has been launched in Composer. I am working on the PR to upstream it:
> > https://github.com/apache/airflow/pull/5594
> > Currently it does not support non-Airflow operators and we are working on
> > a fix.
> >
> > (2) Caching Pickled DAGs in DB to be used by webserver
> > We have a proof-of-concept implementation, working on an AIP now.
> >
> > (3) Using JSON instead of Pickle in (1) and (2)
> > Decided to use JSON because Pickle is not secure and human readable. The
> > serialization approach is very similar to (1).
> >
> > I will update the RP (https://github.com/apache/airflow/pull/5594) to
> > replace Pickle by JSON, and send our design of (2) as an AIP next week.
> > Glad to check together whether our implementation makes sense and do
> > improvements on that.
> >
> > Thanks!
> > Zhou
> >
> >
> > On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik <kaxilnaik@gmail.com> wrote:
> >
> >> Hi all,
> >>
> >> We, at Astronomer, are going to spend time working on DAG Serialisation.
> >> There are 2 AIPs that are somewhat related to what we plan to work on:
> >>
> >>    - AIP-18 Persist all information from DAG file in DB
> >>    <
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB
> >> >
> >>    - AIP-19 Making the webserver stateless
> >>    <
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless
> >> >
> >>
> >> We plan to use JSON as the Serialisation format and store it as a blob
> in
> >> metadata DB.
> >>
> >> *Goals:*
> >>
> >>    - Make Webserver Stateless
> >>    - Use the same version of the DAG across Webserver & Scheduler
> >>    - Keep backward compatibility and have a flag (globally & at DAG
> level)
> >>    to turn this feature on/off
> >>    - Enable DAG Versioning (extended Goal)
> >>
> >>
> >> We will be preparing a proposal (AIP) after some research and some
> initial
> >> work and open it for the suggestions of the community.
> >>
> >> We already had some good brain-storming sessions with Twitter folks
> (DanD
> >> &
> >> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex (from Uber) which
> >> will
> >> be a good starting point for us.
> >>
> >> If anyone in the community is interested in it or has some experience
> >> about
> >> the same and want to collaborate please let me know and join
> >> #dag-serialisation channel on Airflow Slack.
> >>
> >> Regards,
> >> Kaxil
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message