From dev-return-7716-archive-asf-public=cust-asf.ponee.io@airflow.apache.org Wed Feb 27 19:24:43 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id E2760180608 for ; Wed, 27 Feb 2019 20:24:42 +0100 (CET) Received: (qmail 91795 invoked by uid 500); 27 Feb 2019 19:24:41 -0000 Mailing-List: contact dev-help@airflow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.apache.org Delivered-To: mailing list dev@airflow.apache.org Received: (qmail 91782 invoked by uid 99); 27 Feb 2019 19:24:41 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Feb 2019 19:24:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B4183C24BC for ; Wed, 27 Feb 2019 19:24:40 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.799 X-Spam-Level: * X-Spam-Status: No, score=1.799 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id A1-XaxUV1DgX for ; Wed, 27 Feb 2019 19:24:38 +0000 (UTC) Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 533A75F35B for ; Wed, 27 Feb 2019 19:24:38 +0000 (UTC) Received: by mail-lj1-f180.google.com with SMTP id v10so14938913lji.3 for ; Wed, 27 Feb 2019 11:24:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=nTeT9JVjiBNRg8W7XNPjccaqHTKShEfN580mE5wtn5E=; b=BKr+KOhe3uU483bY6ZZlKL035EfqbyHaaJpZvUFMDZDqq1xRJzANhKeTxVXcMraISh ZBOVOzkM1R18VXLkjCxOM5wAQPneMxESWOQkntEi28XQOloOLFkqp90WE7pINUw3hcQn cXnbCxTvJQds45xGythpYwl4+zunwB43SAhJfwScNh0GTgLbGNX/0d5U9eJXVUKAuxXu Z6G3uvF7va1YM6hmg6xXlE1v4maJcAyzhYXiUnt3cX94/2AGJduCVZZU+J83jUA6g0L1 sZtrOKoJsK5xywvrKrQ/sCmkLup6feeeCABXykYaJ2RccQWnTT2S0iI9lLNJkO03PWwL LjcQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=nTeT9JVjiBNRg8W7XNPjccaqHTKShEfN580mE5wtn5E=; b=F71K6Uol4gzp5yWUCgzyU1Byjtsr5+T4Fd4gbrFDd6SFK57o/ZY4MnM6d4y3zhWup/ 5EWHXHLz8/oxfcZxs0Q073wxHEja23le+U4MYSmpQlG85DKqW0anZE157QoYuN25bzV2 rtVDQKGHaTm+Y5uGwNH6+K7E4OGdiT2hRG6834u3OGbfaZDEKftR5DiUWoT1vI2emgL2 XdMH3H4bCHa9P3mYdMn/ByTAeZlclM2yT+SdcLycDWYCGqXQcSVibaciHYuSWGyNpaiq 0BeTc/MPIaeuGnvnFLkNarzs3zdjPfoJIoAn8oNGi5epyS+BasElWeHbqC+c8pwnXZsV BImA== X-Gm-Message-State: APjAAAVlQgfIPosSu2wLV1eMcuevv3drQS3VVEKBZlrk4d7pzvfY+USY Fzys2dWaTSvqOO8mB35m0cGzZzzVCZkT6CwgUVWzo+I5 X-Google-Smtp-Source: AHgI3IYm0jhxTCGZdLRbuwxSQK/g8Jkx0Po74WYvaJwZd+50hQvDBTLipOmzIg/L8SunmbJ0W/slx0+iX0SRk3D6D3U= X-Received: by 2002:a2e:9c97:: with SMTP id x23mr2385758lji.13.1551295471484; Wed, 27 Feb 2019 11:24:31 -0800 (PST) MIME-Version: 1.0 References: <95AFC2E8-F1CF-41AD-9101-DFE0E6FB4AE5@godatadriven.com> <14DCBDD2-9D73-465C-B976-1AB90B93733C@apache.org> In-Reply-To: From: Maxime Beauchemin Date: Wed, 27 Feb 2019 11:24:20 -0800 Message-ID: Subject: Re: [DISCUSS] AIP-12 Persist DAG into DB To: dev@airflow.apache.org Content-Type: multipart/alternative; boundary="0000000000008505980582e51fd0" --0000000000008505980582e51fd0 Content-Type: text/plain; charset="UTF-8" Oh a few more things I wanted to bring up. While playing with pex (years ago!), I added a feature to open up a full-featured, DAG-centric CLI from a DAG file. That feature could become the interface of a containerized DAG approach. As far as I know the feature is not well documented and important for containerization. So let me explain it here. It goes something like this: dag = DAG() {...dag is defined here...} if __name__ == "__main__": dag.cli() Here's an example: https://github.com/apache/airflow/blob/master/airflow/example_dags/example_bash_operator.py#L72 Now you can do things like `python mydagmodule.py list_tasks` or `python mydagmodule.py run my_task_id 2019-01-01`. All the dag-centric Airflow CLI commands are made available here, and you don't have to pass the DAG argument to them in this context. In fact it uses the same calls/code as the main CLI. This means that you can bake a docker around that DAG module and set that dag module as your container's entry point. You can imagine now you just need for the worker to fetch the docker from the docker registry and `docker run` the right run command. Thats' fully containerized task execution. I think it would take like 10 lines of code to change Airflow to operate this way. (without taking into account change management or supporting different docker registry, or the web UI, ...). Now about the problem of loosing control over the core I mentioned in my previous email (where the core is baked in the image and we loose centralized control over that logic), I think we need some sort of lightweight Airflow SDK that works over the REST api. The DAGs, instead of importing the whole Airflow python package would only import that SDK, and the server side implementation of the calls can be evolved at a faster pace than the SDK. Max On Wed, Feb 27, 2019 at 10:53 AM James Meickle wrote: > On the topic of using Docker, I highly recommend looking at Argo Workflows > and some of their sample code: https://github.com/argoproj/argo > > tl;dr is that it's a workflow management tool where DAGs are expressed as > YAML manifests, and tasks are just containers run on Kubernetes. > > I think that there's a lot of value in Airflow's use of Python rather than > a YAML-based DSL. But I do think that containers are the future, and I'm > hopeful that Airflow develops in the direction of focusing on being a > principled Python framework for managing tasks/data executed in containers, > and the resulting execution state. > > > On Tue, Feb 26, 2019 at 8:55 PM Maxime Beauchemin < > maximebeauchemin@gmail.com> wrote: > > > Related thoughts: > > > > * on the topic of serialization, let's be clear whether we're talking > about > > unidirectional serialization and *not* deserialization back to the > object. > > This works for making the web server stateless, but isn't a solution > around > > how DAG definition get shipped around on the cluster (which would be nice > > to have from a system standpoint, but we'd have to break lots of dynamic > > features, things like callbacks and attaching complex objects to DAGs, > ...) > > > > * docker as "serialization" is interesting, I looked into "pex" format in > > the past. It's pretty cool to think of DAGs as micro docker application > > that get shipped around and executed. The challenge with this is that it > > makes it hard to control Airflow's core. Upgrading Airflow becomes [also] > > about upgrading the DAG docker images. We had similar concerns with > "pex". > > The data platform team looses their handle on the core, or has to get in > > the docker building business, which is atypical. For an upgrade, you'd > have > > to ask/force the people who own the DAG dockers to upgrade their images, > > else they won't run or something. Contract could be like "we'll only run > > your Airflow-docker-dag container if it's in a certain version range" or > > something like that. I think it's a cool idea. It gets intricate for the > > stateless web server though, it's a bit of a mind bender :) You could ask > > the docker to render the page (isn't that crazy?!) or ask the docker for > a > > serialized version of the DAG that allows you to render the page (similar > > to point 1). > > > > * About storing in the db, for efficiency, the pk should be the SHA of > the > > deterministic serialized DAG. Only store a new entry if the DAG has > > changed, and stamp the DagRun to a FK of that serialized DAG table. If > > people have shapeshifting DAG within DagRuns we just do best effort, show > > them the last one or something like that > > > > * everyone hates pickles (including me), but it really almost works, > might > > be worth revisiting, or at least I think it's good for me to list out the > > blockers: > > * JinjaTemplate objects are not serializable for some odd obscure > > reason, I think the community can solve that easily, if someone wants a > > full brain dump on this I can share what I know > > * Size: as you pickle something, someone might have attached things > > that recurse into hundreds of GBs-size pickle. Like some > > on_failure_callback may bring in the whole Slack api library. That can be > > solved or mitigated in different ways. At some point I thought I'd have a > > DAG.validate() method that makes sure that the DAG can be pickled, and > > serialized to a reasonable size pickle. I also think we'd have to make > sure > > operators are defined as more "abstract" otherwise the pickle includes > > things like the whole pyhive lib and all sorts of other deps. It could be > > possible to limit what gets attached to the pickle (whitelist classes), > and > > dehydrate objects during serialization / and rehydrate them on the other > > size (assuming classes are on the worker too). If that sounds crazy to > you, > > it's because it is. > > > > * the other crazy idea is thinking of git repo (the code itself) as the > > serialized DAG. There are git filesystem in userspace [fuse] that allow > > dynamically accessing the git history like it's just a folder, as in > > `REPO/{ANY_GIT_REF}/dags/mydag.py` . Beautifully hacky. A company with a > > blue logo with a big F on it that I used to work at did that. Talking > about > > embracing config-as-code! The DagRun can just stamp the git SHA it's > > running with. > > > > Sorry about the confusion, config as code gets tricky around the corners. > > But it's all worth it, right? Right!? :) > > > > On Tue, Feb 26, 2019 at 3:09 AM Kevin Yang wrote: > > > > > My bad, I was misunderstanding a bit and mixing up two issues. I was > > > thinking about the multiple runs for one DagRun issue( e.g. after we > > clear > > > the DagRun). > > > > > > This is an orthogonal issue. So the current implementation can work in > > the > > > long term plan. > > > > > > Cheers, > > > Kevin Y > > > > > > On Tue, Feb 26, 2019 at 2:34 AM Ash Berlin-Taylor > > wrote: > > > > > > > > > > > > On 26 Feb 2019, at 09:37, Kevin Yang wrote: > > > > > > > > > > Now since we're already trying to have multiple graphs for one > > > > > execution_date, maybe we should just have multiple DagRun. > > > > > > > > I thought that there is exactly 1 graph for a DAG run - dag_run has a > > > > "graph_id" column > > > > > > --0000000000008505980582e51fd0--