From dev-return-7808-archive-asf-public=cust-asf.ponee.io@airflow.apache.org Fri Mar 8 13:53:02 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 1C1D0180626 for ; Fri, 8 Mar 2019 14:53:01 +0100 (CET) Received: (qmail 83762 invoked by uid 500); 8 Mar 2019 13:53:01 -0000 Mailing-List: contact dev-help@airflow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.apache.org Delivered-To: mailing list dev@airflow.apache.org Received: (qmail 83750 invoked by uid 99); 8 Mar 2019 13:53:00 -0000 Received: from mail-relay.apache.org (HELO mailrelay2-lw-us.apache.org) (207.244.88.137) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Mar 2019 13:53:00 +0000 Received: from themisto.localdomain (231.25.169.217.in-addr.arpa [217.169.25.231]) by mailrelay2-lw-us.apache.org (ASF Mail Server at mailrelay2-lw-us.apache.org) with ESMTPSA id C327A34FE for ; Fri, 8 Mar 2019 13:52:59 +0000 (UTC) From: Ash Berlin-Taylor Content-Type: multipart/alternative; boundary="Apple-Mail=_7B086BAD-7791-484A-86F4-B81D14DEB512" Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [DISCUSS] AIP-12 Persist DAG into DB Date: Fri, 8 Mar 2019 13:52:58 +0000 References: <95AFC2E8-F1CF-41AD-9101-DFE0E6FB4AE5@godatadriven.com> <14DCBDD2-9D73-465C-B976-1AB90B93733C@apache.org> To: dev@airflow.apache.org In-Reply-To: Message-Id: X-Mailer: Apple Mail (2.3273) --Apple-Mail=_7B086BAD-7791-484A-86F4-B81D14DEB512 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Comments inline. > On 8 Mar 2019, at 11:28, Kevin Yang wrote: >=20 > Hi all, > When I was preparing some work related to this AIP I found something = very concerning. I noticed this JIRA ticket = is trying to remove = the dependency of dagbag from webserver, which is awesome--we wanted = badly but never got to start work on. However when I looked at some = subtasks of it, which try to remove dagbag dependency from each = endpoint, I found the way we remove the dependency of dagbag is not very = ideal. For example this PR = will require us to = parse the dag file each time we hit the endpoint. The counter argument: this PR removes the need for the confusing = "Refresh" button from the UI, and in general you only pay the cost for = the expensive DAGs when you ask about them. (I don't know what/when we = call the /pickle_info endpoint of the top of my head) This end point may be one to hold off on (as it can ask for multiple = dags) but there are some that def don't need a full dag bag or to even = parse the dag file, the current DAG model has enough info. > =20 >=20 > If we go down this path, we indeed can get rid of the dagbag = dependency easily, but we will have to 1. increase the DB load( not too = concerning at the moment ), 2. wait the DAG file to be parsed before = getting the page back, potentially multiple times. DAG file can = sometimes take quite a while to parse, e.g. we have some framework DAG = files generating large number of DAGs from some static config files or = even jupyter notebooks and they can take 30+ seconds to parse. Yes we = don't like large DAG files but people do see the beauty of code as = config and sometimes heavily abuseleverage it. Assuming all users have = the same nice small python file that can be parsed fast, I'm still a bit = worried about this approach. Continuing on this path means we've chosen = DagModel to be the serialized representation of DAG and DB columns to = hold different properties--it can be one candidate but I don't know if = we should settle on that now. I would personally prefer a more compact, = e.g. JSON5, and easy to scale representation( such that serializing new = fields !=3D DB upgrade).=20 Do you mean https://json5.org/ or is this a typo? That might be okay for = a nicer user front end, but the "canonical" version stored in the DB = should be something "plainer" like just JSON. I'm not sure that "serializing new fields !=3D DB upgrade" is that big = of a concern, as we don't add fields that often. One possible way of = dealing with it if we do is to have a hybrid approach - a few distinct = columns, but then a JSON blob. (and if we were only to support postgres = we could just use JSONb. But I think our friends at Google may object ;) = ) Adding a new column in a DB migration with a default NULL shouldn't be = an expensive operation, or difficult to achieve. >=20 > In my imagination we would have to collect the list of dynamic = features depending on unserializable fields of a DAG and start a = discussion/vote on dropping support of them( I'm working on this but if = anyone has already done so please take over), decide on the serialized = representation of a DAG and then replace dagbag with it in webserver. = Per previous discussion and some offline discussions with Dan, one = future of DAG serialization that I like would look similar to this: >=20 > https://imgur.com/ncqqQgc Something I've thought about before for other things was to embed an API = server _into_ the scheduler - this would be useful for k8s healthchecks, = native Prometheus metrics without needed statsd bridge, and could have = endpoints to get information such as this directly.=20 I was thinking it would be _in_ the scheduler process using either = threads (ick. Python's still got a GIL doesn't it?) or using = async/twisted etc. (not a side-car process like we have with the logs = webserver for `airflow worker`). (This is possibly an unrelated discussion, but might be worth talking = about?) > We can still discuss/vote which approach we want to take but I don't = want the door to above design to be shut right now or we have to spend a = lot effort switch path later. >=20 > Bas and Peter, I'm very sorry to extend the discussion but I do think = this is tightly related to the AIP and PRs behind it. And my sincere = apology for bringing this up so late( I only pull the open PR list = occasionally, if there's a way to subscribe to new PR event I'd love to = know how). It's noisy, but you can subscribe to commits@airflow.apache.org (but be = warned, this also includes all Jira tickets, edits of every comment on = github etc.). >=20 > Cheers, > Kevin Y >=20 > On Thu, Feb 28, 2019 at 1:36 PM Peter van t Hof > wrote: > Hi all, >=20 > Just some comments one the point Bolke dit give in relation of my PR. >=20 > At first, the main focus is: making the webserver stateless.=20 >=20 > > 1) Make the webserver stateless: needs the graph of the *current* = dag >=20 > This is the main goal but for this a lot more PR=E2=80=99s will be = coming once my current is merged. For edges and graph view this is = covered in my PR already. >=20 > > 2) Version dags: for consistency mainly and not requiring parsing of = the > > dag on every loop >=20 > In my PR the historical graphs will be stored for each DagRun. This = means that you can see if an older DagRun was the same graph structure, = even if some tasks does not exists anymore in the current graph. = Especially for dynamic DAG=E2=80=99s this is very useful. >=20 > > 3) Make the scheduler not require DAG files. This could be done if = the > > edges contain all information when to trigger the next task. We can = then > > have event driven dag parsing outside of the scheduler loop, ie. by = the > > cli. Storage can also be somewhere else (git, artifactory, = filesystem, > > whatever). >=20 > The scheduler is almost untouched in this PR. The only thing that is = added is that this edges are saved to the database but the scheduling = itself din=E2=80=99t change. The scheduler depends now still on the DAG = object. >=20 > > 4) Fully serialise the dag so it becomes transferable to workers >=20 > It nice to see that people has a lot of idea=E2=80=99s about this. But = as Fokko already mentioned this is out of scope for the issue what we = are trying to solve. I also have some idea=E2=80=99s about this but I = like to limit this PR/AIP to the webserver. >=20 > For now my PR does solve 1 and 2 and the rest of the behaviour (like = scheduling) is untouched. >=20 > Gr, > Peter >=20 --Apple-Mail=_7B086BAD-7791-484A-86F4-B81D14DEB512--