From dev-return-9028-archive-asf-public=cust-asf.ponee.io@airflow.apache.org Tue Jul 30 00:42:30 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 35E6C18063F for ; Tue, 30 Jul 2019 02:42:30 +0200 (CEST) Received: (qmail 77796 invoked by uid 500); 30 Jul 2019 00:42:28 -0000 Mailing-List: contact dev-help@airflow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.apache.org Delivered-To: mailing list dev@airflow.apache.org Received: (qmail 77784 invoked by uid 99); 30 Jul 2019 00:42:28 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Jul 2019 00:42:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 9430A1826EC for ; Tue, 30 Jul 2019 00:42:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.052 X-Spam-Level: ** X-Spam-Status: No, score=2.052 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, KAM_SHORT=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id RIskEIIx5xzF for ; Tue, 30 Jul 2019 00:42:24 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::d29; helo=mail-io1-xd29.google.com; envelope-from=yrqls21@gmail.com; receiver= Received: from mail-io1-xd29.google.com (mail-io1-xd29.google.com [IPv6:2607:f8b0:4864:20::d29]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id A145E7DC04 for ; Tue, 30 Jul 2019 00:42:23 +0000 (UTC) Received: by mail-io1-xd29.google.com with SMTP id j5so120095343ioj.8 for ; Mon, 29 Jul 2019 17:42:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=cGOjZzlwe+trwtzB1P4tJo3DrAucD7VQ3uQSkEzlMa4=; b=IZcmFeaBtueANKF5ZlaFUqf0zc+fADw2jQI+iTxmQeiblWEoZKmvAoIoN5bBKUZ80i 4puIp3Wu+F1cPxOuz5eSseWxFKrNVDDXyRhP9qR7y1S2sk00DRfLIrWf6GJhFhWOQx6x YQWJGO0WJvwkMddsnQ5lsmdh+ZUMPUxoKCXJvcYBHVxgs7ibXe+CJabPSyMfU56bGUnY FJ0xcI6bmuuqF3uQFTzELFORVsj/ZyEzEomF5InX4GIVA16T3Rij6I7+KVUYtkT9wLZz Rb77sKF+W09Nrkc2Mlyh41Ebtu5cA7NXU8sByuEcrZOf1BuDnF24Nq7hEJTmFlHX279r 7c5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=cGOjZzlwe+trwtzB1P4tJo3DrAucD7VQ3uQSkEzlMa4=; b=muM+DBRBWjcSMD8WVVHwR1Wz9KyLSzuYtpHckpncipfx2PuKuOb4LjR9W1iYoMu0PL GHgXdf5T9L/5Mj7GpD07IDV7MRGG2E33tYc/GKXdAwpAY8OypAU1+82JphwwfG/cNjt6 6Ui4Bm5miwQh4XenPDxbHjlxLSsxNNCeb+9XHIRal92sKD/0PI2Z8S4zuYFOtMUssvty IjVNVEMG5Wwaa3GtpIL4LaZhC9wklg7ifsIBwWBlcbRYMDxP5/e5RDXDTn4r14gZGc/+ rrUjvS0ch7QFycD4v+NMFkRCUtqmfkUkJmgZqfrBFpYvNhqVJmnacUG4rVpXah3f0+zX VRrw== X-Gm-Message-State: APjAAAUZyez25YYjnW1TQ2UWiaiUP1GJFxAQ+rlS5FimjEwSSBnyEEnq KQMeQMWcjYbMtgF5zl73bXR1j4sXDVGkHtkT+9TOrWPQ X-Google-Smtp-Source: APXvYqyK+qQWWuxSimbAN4vXwMoSt6QSn0aElBxELEVbVbQ00wJcbxWtt0hn8UG184iTPO7aJmmIwIDTVCeILzsHTPA= X-Received: by 2002:a6b:d81a:: with SMTP id y26mr98859194iob.126.1564447342210; Mon, 29 Jul 2019 17:42:22 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Kevin Yang Date: Mon, 29 Jul 2019 17:42:10 -0700 Message-ID: Subject: Re: Airflow DAG Serialisation To: Zhou Fang Cc: dev@airflow.apache.org Content-Type: multipart/alternative; boundary="0000000000001a330f058edb48bb" --0000000000001a330f058edb48bb Content-Type: text/plain; charset="UTF-8" Hi Zhou, Totally understood, thank you for that. Streaming logic does cover most cases, tho we still have the worst cases where os.walk doesn't give us consistent file and file/dir additions/renames causing a different result order of list_py_file_paths( e.g. right after we parsed the first dir, it was renamed and will be parsed last in the 2nd DAG loading, or we merged a new file right after the file paths are collected). Maybe there's a way to guarantee the order of parsing but not sure if it worth the effort given that it is less of a problem if the end to end parsing time is small enough. I understand it may be started as a short term improvement but since it should not be too much more complicated, we'd rather start with unified long term pattern. Cheers, Kevin Y On Mon, Jul 29, 2019 at 3:59 PM Zhou Fang wrote: > Hi Kevin, > > Yes. DAG persistence in DB is definitely the way to go. I referred to the > aysnc dag loader because it may alleviate your current problem (since it is > code ready). > > It actually reduces the time to 15min, because DAGs are refreshed by the > background process in a streaming way and you don't need to restart > webserver per 20min. > > > > Thanks, > Zhou > > > On Mon, Jul 29, 2019 at 3:14 PM Kevin Yang wrote: > >> Hi Zhou, >> >> Thank you for the pointer. This solves the issue gunicorn restart rate >> throttles webserver refresh rate but not the long DAG parsing time issue, >> right? Worst case scenario we still wait 30 mins for the change to show up, >> comparing to the previous 35 mins( I was wrong on the number, it should be >> 35 mins instead of 55 mins as the clock starts whenever the webserver >> restarts). I believe in the previous discussion, we firstly proposed this >> local webserver DAG parsing optimization to use the same DAG parsing logic >> in scheduler to speed up the parsing. Then the stateless webserver proposal >> came up and we were brought in that it is a better idea to persist DAGs >> into the DB and read directly from the DB, better DAG def consistency and >> webserver cluster consistency. I'm all supportive on the proposed structure >> in AIP-24 but -1 on just feed webserver with a single subprocess parsing >> the DAGs. I would image there won't be too many additional work to fetch >> from DB instead of a subprocess, would there?( haven't look into the >> serialization format part but assuming they are the same/similar) >> >> Cheers, >> Kevin Y >> >> On Mon, Jul 29, 2019 at 2:18 PM Zhou Fang wrote: >> >>> Hi Kevin, >>> >>> The problem that DAG parsing takes a long time can be solved by >>> Asynchronous DAG loading: https://github.com/apache/airflow/pull/5594 >>> >>> The idea is the a background process parses DAG files, and sends DAGs to >>> webserver process every [webserver] dagbag_sync_interval = 10s. >>> >>> We have launched it in Composer, so our users can set webserver worker >>> restart interval to 1 hour (or longer). The background DAG parsing >>> processing refresh all DAGs per [webserver] = collect_dags_interval = 30s >>> . >>> >>> If parsing all DAGs take 15min, you can see DAGs being gradually freshed >>> with this feature. >>> >>> Thanks, >>> Zhou >>> >>> >>> On Sat, Jul 27, 2019 at 2:43 AM Kevin Yang wrote: >>> >>>> Nice job Zhou! >>>> >>>> Really excited, exactly what we wanted for the webserver scaling issue. >>>> Want to add another big drive for Airbnb to start think about this >>>> previously to support the effort: it can not only bring consistency >>>> between >>>> webservers but also bring consistency between webserver and >>>> scheduler/workers. It may be less of a problem if total DAG parsing >>>> time is >>>> small, but for us the total DAG parsing time is 15+ mins and we had to >>>> set >>>> the webserver( gunicorn subprocesses) restart interval to 20 mins, which >>>> leads to a worst case 15+20+15=50 mins delay between scheduler start to >>>> schedule things and users can see their deployed DAGs/changes... >>>> >>>> I'm not so sure about the scheduler performance improvement: currently >>>> we >>>> already feed the main scheduler process with SimpleDag through >>>> DagFileProcessorManager running in a subprocess--in the future we feed >>>> it >>>> with data from DB, which is likely slower( tho the diff should have >>>> negligible impact to the scheduler performance). In fact if we'd keep >>>> the >>>> existing behavior, try schedule only fresh parsed DAGs, then we may >>>> need to >>>> deal with some consistency issue--dag processor and the scheduler race >>>> for >>>> updating the flag indicating if the DAG is newly parsed. No big deal >>>> there >>>> but just some thoughts on the top of my head and hopefully can be >>>> helpful. >>>> >>>> And good idea on pre-rendering the template, believe template rendering >>>> was >>>> the biggest concern in the previous discussion. We've also chose the >>>> pre-rendering+JSON approach in our smart sensor API >>>> < >>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization >>>> > >>>> and >>>> seems to be working fine--a supporting case for ur proposal ;) There's >>>> a WIP >>>> PR for it just in case >>>> you >>>> are interested--maybe we can even share some logics. >>>> >>>> Thumbs-up again for this and please don't heisitate to reach out if you >>>> want to discuss further with us or need any help from us. >>>> >>>> >>>> Cheers, >>>> Kevin Y >>>> >>>> On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko >>> > >>>> wrote: >>>> >>>> > Looks great Zhou, >>>> > >>>> > I have one thing that pops in my mind while reading the AIP; should >>>> keep >>>> > the caching on the webserver level. As the famous quote goes: *"There >>>> are >>>> > only two hard things in Computer Science: cache invalidation and >>>> naming >>>> > things." -- Phil Karlton* >>>> > >>>> > Right now, the fundamental change that is being proposed in the AIP is >>>> > fetching the DAGs from the database in a serialized format, and not >>>> parsing >>>> > the Python files all the time. This will give already a great >>>> performance >>>> > improvement on the webserver side because it removes a lot of the >>>> > processing. However, since we're still fetching the DAGs from the >>>> database >>>> > in a regular interval, cache it in the local process, so we still >>>> have the >>>> > two issues that Airflow is suffering from right now: >>>> > >>>> > 1. No snappy UI because it is still polling the database in a >>>> regular >>>> > interval. >>>> > 2. Inconsistency between webservers because they might poll in a >>>> > different interval, I think we've all seen this: >>>> > https://www.youtube.com/watch?v=sNrBruPS3r4 >>>> > >>>> > As I also mentioned in the Slack channel, I strongly feel that we >>>> should be >>>> > able to render most views from the tables in the database, so without >>>> > touching the blob. For specific views, we could just pull the blob >>>> from the >>>> > database. In this case we always have the latest version, and we >>>> tackle the >>>> > second point above. >>>> > >>>> > To tackle the first one, I also have an idea. We should change the DAG >>>> > parser from a loop to something that uses inotify >>>> > https://pypi.org/project/inotify_simple/. This will change it from >>>> polling >>>> > to an event-driven design, which is much more performant and less >>>> resource >>>> > hungry. But this would be an AIP on its own. >>>> > >>>> > Again, great design and a comprehensive AIP, but I would include the >>>> > caching on the webserver to greatly improve the user experience in >>>> the UI. >>>> > Looking forward to the opinion of others on this. >>>> > >>>> > Cheers, Fokko >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > Op za 27 jul. 2019 om 01:44 schreef Zhou Fang >>>> >>> > >: >>>> > >>>> > > Hi Kaxi, >>>> > > >>>> > > Just sent out the AIP: >>>> > > >>>> > > >>>> > >>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler >>>> > > >>>> > > Thanks! >>>> > > Zhou >>>> > > >>>> > > >>>> > > On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang >>>> wrote: >>>> > > >>>> > > > Hi Kaxil, >>>> > > > >>>> > > > We are also working on persisting DAGs into DB using JSON for >>>> Airflow >>>> > > > webserver in Google Composer. We target at minimizing the change >>>> to the >>>> > > > current Airflow code. Happy to get synced on this! >>>> > > > >>>> > > > Here is our progress: >>>> > > > (1) Serializing DAGs using Pickle to be used in webserver >>>> > > > It has been launched in Composer. I am working on the PR to >>>> upstream >>>> > it: >>>> > > > https://github.com/apache/airflow/pull/5594 >>>> > > > Currently it does not support non-Airflow operators and we are >>>> working >>>> > on >>>> > > > a fix. >>>> > > > >>>> > > > (2) Caching Pickled DAGs in DB to be used by webserver >>>> > > > We have a proof-of-concept implementation, working on an AIP now. >>>> > > > >>>> > > > (3) Using JSON instead of Pickle in (1) and (2) >>>> > > > Decided to use JSON because Pickle is not secure and human >>>> readable. >>>> > The >>>> > > > serialization approach is very similar to (1). >>>> > > > >>>> > > > I will update the RP (https://github.com/apache/airflow/pull/5594) >>>> to >>>> > > > replace Pickle by JSON, and send our design of (2) as an AIP next >>>> week. >>>> > > > Glad to check together whether our implementation makes sense and >>>> do >>>> > > > improvements on that. >>>> > > > >>>> > > > Thanks! >>>> > > > Zhou >>>> > > > >>>> > > > >>>> > > > On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik >>>> > wrote: >>>> > > > >>>> > > >> Hi all, >>>> > > >> >>>> > > >> We, at Astronomer, are going to spend time working on DAG >>>> > Serialisation. >>>> > > >> There are 2 AIPs that are somewhat related to what we plan to >>>> work on: >>>> > > >> >>>> > > >> - AIP-18 Persist all information from DAG file in DB >>>> > > >> < >>>> > > >> >>>> > > >>>> > >>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB >>>> > > >> > >>>> > > >> - AIP-19 Making the webserver stateless >>>> > > >> < >>>> > > >> >>>> > > >>>> > >>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless >>>> > > >> > >>>> > > >> >>>> > > >> We plan to use JSON as the Serialisation format and store it as >>>> a blob >>>> > > in >>>> > > >> metadata DB. >>>> > > >> >>>> > > >> *Goals:* >>>> > > >> >>>> > > >> - Make Webserver Stateless >>>> > > >> - Use the same version of the DAG across Webserver & Scheduler >>>> > > >> - Keep backward compatibility and have a flag (globally & at >>>> DAG >>>> > > level) >>>> > > >> to turn this feature on/off >>>> > > >> - Enable DAG Versioning (extended Goal) >>>> > > >> >>>> > > >> >>>> > > >> We will be preparing a proposal (AIP) after some research and >>>> some >>>> > > initial >>>> > > >> work and open it for the suggestions of the community. >>>> > > >> >>>> > > >> We already had some good brain-storming sessions with Twitter >>>> folks >>>> > > (DanD >>>> > > >> & >>>> > > >> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex (from Uber) >>>> which >>>> > > >> will >>>> > > >> be a good starting point for us. >>>> > > >> >>>> > > >> If anyone in the community is interested in it or has some >>>> experience >>>> > > >> about >>>> > > >> the same and want to collaborate please let me know and join >>>> > > >> #dag-serialisation channel on Airflow Slack. >>>> > > >> >>>> > > >> Regards, >>>> > > >> Kaxil >>>> > > >> >>>> > > > >>>> > > >>>> > >>>> >>> --0000000000001a330f058edb48bb--