From dev-return-7558-archive-asf-public=cust-asf.ponee.io@airflow.apache.org Fri Feb 1 18:06:26 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id EB5C7180627 for ; Fri, 1 Feb 2019 19:06:25 +0100 (CET) Received: (qmail 4530 invoked by uid 500); 1 Feb 2019 18:06:24 -0000 Mailing-List: contact dev-help@airflow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.apache.org Delivered-To: mailing list dev@airflow.apache.org Received: (qmail 4505 invoked by uid 99); 1 Feb 2019 18:06:24 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Feb 2019 18:06:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id B80B6C09F9 for ; Fri, 1 Feb 2019 18:06:23 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.798 X-Spam-Level: * X-Spam-Status: No, score=1.798 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=twitter.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id S6yZ3qn58DCn for ; Fri, 1 Feb 2019 18:06:21 +0000 (UTC) Received: from mail-it1-f176.google.com (mail-it1-f176.google.com [209.85.166.176]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 882E160CDB for ; Fri, 1 Feb 2019 18:06:20 +0000 (UTC) Received: by mail-it1-f176.google.com with SMTP id m8so5599721itk.0 for ; Fri, 01 Feb 2019 10:06:20 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=JHWwEUiLZ79Xi9MM/zBIAfwpAkFr914cmizPQubJ9Yc=; b=cTh6ZskQFPbISoGUmUU8wZ/BzdyG94qT8ELV6/4S/m/bXsCfC7CwHL35Hk356zHXyp O2XHuLC3GK87aWFCTVccRtfcSABtSQJZxI0mHvRiiUz2tTTPaoH3FmOIbOltE/a8PsMt 9paUl6FZTcWd/aQ5BNbw5G6opZ0T3+9rGfM9Pmm+UqoWLTkENactlRboC9pvG7Kl7lhk 6I0JVyAqibkGXPVrA85ICijHcT4+pVgVjFs5DCk+eU7dQdWgxtAqLwcEWoSbdMMT/5Xa bA6B9iwVAAQkSFDPf2xcdM5gkT6Dko9+Hk1Rddyv/oVJss+GSfxDNpclHmK6QF3xWX7d 4JGQ== X-Gm-Message-State: AJcUukfldjFDITQ3afnsMxFMPBMCFi3I5OBML1Q3aN7MF8xabOcOdkE6 XJnL6K8RGCsBNF+uFQSYPG+/lmiWUuzQ/ZkvIUYXMn/F X-Google-Smtp-Source: ALg8bN5QORQ0VnmYyM9wEEDbguA4GKffxkV9hqZRpzg6HJdP6IHznfW40NQvpPP3LK6PjCH+v3SMCGYL3LRii8PDz5o= X-Received: by 2002:a02:1bc5:: with SMTP id 66mr26750646jas.105.1549044378759; Fri, 01 Feb 2019 10:06:18 -0800 (PST) MIME-Version: 1.0 References: <332B8B15-A1B7-4CB2-A210-D061DEBEC62E@apache.org> In-Reply-To: From: Dan Davydov Date: Fri, 1 Feb 2019 13:05:42 -0500 Message-ID: Subject: Re: AIP-12 Persist DAG into DB To: dev@airflow.apache.org Content-Type: multipart/alternative; boundary="000000000000f023db0580d8ffac" --000000000000f023db0580d8ffac Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable @Max What I've been thinking about recently is creating an abstraction for the serialization process. I think in general it makes sense to have for e.g. dynamic DAGs to have a service that periodically serializes DAGs and uploads them to e.g. a database via some new Airflow DAG Uploader Service. There should be proper support for authentication models for this DB/wrapping service. You would also potentially get the ability for users to submit ad-hoc DAGs to the production server this way too (instead of needing a custom devel instance). On Fri, Feb 1, 2019 at 12:43 PM Ben Tallman wrote: > In my experience, there are two major wins to chase here. Neither are > simple, nor is this the first discussion around them. In the past there w= as > an attempt to use Pickling to handle these challenges. > > The first is that with dynamic dags (they are evaluated as python code > after all), it is possible that each DagRun of a Dag is different, either > slightly or completely. This is a very powerful concept, but currently > basically breaks, as the Dag itself is re-evaluated every time it is used= , > and therefore needs to be quite stable during a DagRun. I believe it woul= d > be a huge win if the DagRun itself were stable from the time it starts > until it's completion, across the whole cluster, and then even into the > history of runs in the webserver. > > The second win to chase is a bit different, and deals with the history of > DagRuns. Specifically, what happens to history (logs, results, etc) when = a > Dag is re-run, either because of an error that has been corrected, or > because the user has changed the Dag and decides to backfill. In that cas= e, > I believe that being able to see the history of a Dag's run in a particul= ar > schedule is hugely valuable, both for retaining history (chain of > custody/audit like reasons), as well as seeing change over time and > tracking statistics. > > Just my few cents. > > Thanks, > Ben > > -- > Ben Tallman - 503.680.5709 > > > On Thu, Jan 31, 2019 at 10:12 PM Maxime Beauchemin < > maximebeauchemin@gmail.com> wrote: > > > Right, it's been discussed extensively in the past and the main thing > > needed to get to a "stateless web server" (or at least a DagBag-free we= b > > server) is to drop the template rendering in the UI. Also we might need > > little workarounds (we'd have to dig in to check) around deleting task > > instances or force-running tasks, nothing major I think. > > > > Also the scheduler (think of it as a "supervisor", as this specific > > workload has nothing to do with scheduling), would need to serialize th= e > > DAGs periodically, likely to the database, so that the web server can g= et > > freshly serialized metadata from the database during the scope of web > > requests. > > > > Max > > > > On Thu, Jan 31, 2019 at 9:28 AM Dan Davydov > > > wrote: > > > > > Agreed on complexities (I think deprecating Jinja templates for > webserver > > > rendering is one thing), but I'm not sure I understand on the falling > > down > > > on code changes part, mind providing an example? > > > > > > On Thu, Jan 31, 2019 at 12:22 PM Ash Berlin-Taylor > > wrote: > > > > > > > That sounds like a good idea at first, but falls down with possible > > code > > > > changes in operators between one task and the next. > > > > > > > > (I would like this, but there are definite complexities) > > > > > > > > -ash > > > > > > > > > > > > On 31 January 2019 16:56:54 GMT, Dan Davydov > > > > > > > wrote: > > > > >I feel the right higher-level solution to this problem (which is > > > > >"Adding > > > > >Consistency to Airflow") is DAG serialization, that is all DAGs > should > > > > >be > > > > >represented as e.g. JSON (similar to the current SimpleDAGBag obje= ct > > > > >used > > > > >by the Scheduler). This solves the webserver issue, and also adds > > > > >consistency between Scheduler/Workers (all DAGruns can be ensured = to > > > > >run at > > > > >the same version of a DAG instead of whatever happens to live on t= he > > > > >worker > > > > >at the time). > > > > > > > > > >On Thu, Jan 31, 2019 at 9:44 AM Peter van =E2=80=98t Hof < > > > > >petervanthof@godatadriven.com> wrote: > > > > > > > > > >> Hi All, > > > > >> > > > > >> As most of you guys know, airflow got an issue when loading new > dags > > > > >where > > > > >> the webserver sometimes sees it and sometimes not. > > > > >> Because of this we did wrote this AIP to solve this issue: > > > > >> > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-12+Persist+DAG+in= to+DB > > > > >> > > > > >> Any feedback is welcome. > > > > >> > > > > >> Gr, > > > > >> Peter van 't Hof > > > > >> Big Data Engineer > > > > >> > > > > >> GoDataDriven > > > > >> Wibautstraat 202 > > > > >> 1091 GS Amsterdam > > > > >> https://godatadriven.com > > > > >> > > > > >> > > > > > > > > > > --000000000000f023db0580d8ffac--