From dev-return-6650-archive-asf-public=cust-asf.ponee.io@airflow.incubator.apache.org Tue Sep 25 00:20:11 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id B5897180675 for ; Tue, 25 Sep 2018 00:20:10 +0200 (CEST) Received: (qmail 92862 invoked by uid 500); 24 Sep 2018 22:20:09 -0000 Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list dev@airflow.incubator.apache.org Received: (qmail 92813 invoked by uid 99); 24 Sep 2018 22:20:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Sep 2018 22:20:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 840CF1A1AF9 for ; Mon, 24 Sep 2018 22:20:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.869 X-Spam-Level: * X-Spam-Status: No, score=1.869 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, T_DKIMWL_WL_HIGH=-0.01] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=airbnb.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id lzex51biNj0t for ; Mon, 24 Sep 2018 22:20:05 +0000 (UTC) Received: from mail-ot1-f41.google.com (mail-ot1-f41.google.com [209.85.210.41]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 1B7815F535 for ; Mon, 24 Sep 2018 22:20:04 +0000 (UTC) Received: by mail-ot1-f41.google.com with SMTP id c18-v6so4754172otm.3 for ; Mon, 24 Sep 2018 15:20:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=ml3531nYbJ0zoEOF2oBaXVpyWcPrfVoBGAY4MrxJ64o=; b=W/66AKj+Z3Z7lrfJrSin+lYL/VbQtxsnb2rUIkxNneMqZ7CaHBEShDiuzk1GBuwVCo 8guZqvGkR5JwmsQvUNokVEwmPIcCcg4/W6z/E/fvJAzgyEnQgUXo5Vf7GVQfAkVMCF8J O3c6E4lBCFc1zl626MsbWeDSCJwM618nMKEnv1CP/sCFIbyTA3sW/e4BCnVcD1Ejrj6R zj1yjJhAOVi/nRTKHZ7hoZ6D5AoREaZ2PVDNYrinj8DFq2Tx3peAn5IPk0r5SVh6iVM8 2HhWdsZ6EkaWmXlqXIC7sdVngRHMlI/IkTeOIrGbSBMkPvLsxjeciB5DBwwVMFogV7/l pHlg== X-Gm-Message-State: ABuFfoje8uJ+O8KYsRuyO4xAFDj5jRFKuLDTbqrfg7ibpJB3o+RkR2QF 1JijkBpJDqczzEc0bwyVsg6e6+wc5G4MqKH8yJYF8sXC X-Google-Smtp-Source: ACcGV62wSORkZbGJCThL9PlzQ9c1NUV+nxSyg5FgTtg9AwQhQ1smhd94cEyU90HICmElN9OE1HgpzzBo1whcxd+gKn4= X-Received: by 2002:a9d:3ccc:: with SMTP id t12-v6mr684037otf.274.1537827602430; Mon, 24 Sep 2018 15:20:02 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Alex Guziel Date: Mon, 24 Sep 2018 15:19:50 -0700 Message-ID: Subject: Re: Fundamental change - Separate DAG name and id. To: dev@airflow.incubator.apache.org Content-Type: multipart/alternative; boundary="000000000000f825f70576a56352" --000000000000f825f70576a56352 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I think decoupling dag_id and display name could be confusing and cumbersome. As for readme, DAG already has a field called description which I think is close to what Alex is describing (I believe it is displayed by the UI). On Mon, Sep 24, 2018 at 3:12 PM Alex Tronchin-James 949-412-7220 < alex.n.james@gmail.com> wrote: > Re: [Brian Greene] "How does filename matter? Frankly I wish the filenam= e > was REQUIRED to be the dag name so people would quit confusing themselves > by mismatching them !" > > FWIW in the Facebook predecessor to airflow, the file path/name WAS the d= ag > name. E.g. if your dag resided in best_team/new_project/sweet_dag.py then > the dag name would be best_team.new_project.sweet_dag > All tasks were identified by their variable name after that prefix: E.g. = if > best_team.new_project.sweet_dag defines an operator in a variable named > task1, then the respective task_id is > best_team.new_project.sweet_dag.task1. > > Airflow provides additional flexibility to specify DAG and task names to > avoid the sometimes annoyingly long task names this resulted in and allow > DAG/task names without forcing a code directory structure and python's > variable naming restrictions, and I think this is a Good Thing. > > It seems like airflowuser is trying to provide additional metadata beyond > the DAG/task names (so far, a DAG 'title' distinct from the ID). I've > provided this through a README.md included in the DAG source directory, b= ut > maybe it would be a win to instead add a DAG parameter named 'readme' of > string type which can include a docstring or even markdown to provide any > desired additional metadata? This could then be displayed by the UI to > simplify access to any such provided DAG documentation. > > =F0=9F=8D=BF > > > > On Thu, Sep 20, 2018 at 10:45 PM Brian Greene < > brian@heisenbergwoodworking.com> wrote: > > > Prior to using airflow for much, on first inspection, I think I may hav= e > > agreed with you. > > > > After a bit of use I=E2=80=99d agree with Fokko and others - this isn= =E2=80=99t really a > > problem, and separating them seems to do more harm than good related to > > deployment. > > > > I was gonna stop there, but why? > > > > You can add a task to a dag that=E2=80=99s deployed and has run and sti= ll view > > history. The =E2=80=9Cnew=E2=80=9D task shows up white Squares in the = old dags. nobody > > said you=E2=80=99re required to also rename the dag when you do so this= . If your > > process or desire or design determines you need to rename it, well then > by > > definition... isn=E2=80=99t it a new thing without a history? Airflow = is > > implementing exactly that. > > > > One could argue that renaming to reflect exact purpose is good practice= . > > Yes, I=E2=80=99d agree, but again following that logic if it=E2=80=99s = a small enough > > change to =E2=80=9Cslip in=E2=80=9D then the name likely shouldn=E2=80= =99t change. If it=E2=80=99s big > > enough I want to change the name then it=E2=80=99s a big enough change = that I=E2=80=99m > > functionally running something =E2=80=9Cnew=E2=80=9D, and I expect to n= eed to account for > > that. Airflow is enforcing that logic by coupling the name to the > > deployment of what you said was a new process. > > > > One might put forth that changing the name to be more descriptive In th= e > > ui makes it easier for support staff. I think perhaps if that=E2=80=99= s your > > challenge it=E2=80=99s not airflow that=E2=80=99s a problem. Dags are = of course > documented > > elsewhere besides their name, right? Yeah it=E2=80=99s self documentin= g (and the > > graphs are cool), but I have to assume there=E2=80=99s something beside= s the NAME > > to tell people what it does. Additionally, far more than the name is > > required for even an operator or monitor watcher to take action - you > don=E2=80=99t > > expect them to know which tasks to rerun or how to troubleshoot failure= s > > just based on your =E2=80=9Cnow most descriptive name in the UI=E2=80= =9D do you? > > > > I spent time In an informatica shop where all the jobs were numbered. > > Numbered. Let=E2=80=99s be more exact... their NAMES were NUMBERS like= 56709. > > Terrible, but 100% worked, because while a descriptive name would have > been > > useful, the name is the thing that=E2=80=99s supposed to NOT CHANGE (se= e code of > > Abibarshim), and all the other information can attach to that in places > > where you write... other information. People would curse a number =E2= =80=9CF=E2=80=99ing > > 6291 failed again=E2=80=9D - everyone knew what they were talking about= .. I > digress. > > > > You might decide to document =E2=80=9Cdag ID 12=E2=80=9D or just =E2= =80=9C12=E2=80=9D on your wiki - I=E2=80=99m > > going to document =E2=80=9Cdaily_sales_import=E2=80=9D. And when thing= s start failing at > > 3am it=E2=80=99s not my dag =E2=80=9C56=E2=80=9D that=E2=80=99s failing= , it=E2=80=99s the sales_export dag. But > if > > you document =E2=80=9C12=E2=80=9D, that=E2=80=99s still it=E2=80=99s na= me, and it=E2=80=99d better be 12 in all > > your environments and documents. This also means the actual db IDs fro= m > > your proposal are almost certainly NOT the same across your environment= s, > > making the 12 unchangeable name! > > > > There are lots of languages (most of them) where the name of a thing is > > important and hard to change. It=E2=80=99s not a bad thing, and I=E2= =80=99d assume that > > deploying a thing by name has some significance in many systems. Go > rename > > a class in... pick a language... tell me how that should be easier to d= o > > willy-nilly so it=E2=80=99s easier In the UI. > > > > I suppose you could view it as a limitation, But i don=E2=80=99t think = you=E2=80=99ve > > illuminated a single use case where it=E2=80=99s an actual technical co= nstraint > or > > limitation. > > > > The BEST argument against the current implementation is db performance. > > It=E2=80=99s a hogwash argument. Basic key indexes on low cardinality = string > > columns are plenty fast for the airflow workload, and if your task load > is > > so high airflow can=E2=80=99t keep up or your seeing super-fast tasks a= nd airflow > > db/tracking latency is too much... perhaps a messaging or queue > processing > > solution is better suited to those workloads. We see scheduler > bottlenecks > > long before the database for our =E2=80=9Cquick task=E2=80=9D scenarios= . Additionally, > > reading through this list you=E2=80=99ll find people running airflow at > substantial > > scale - I=E2=80=99ve not seen anyone complaining of production performa= nce issues > > based on this design decision. At first I hated it. String keys are > > dirty, we=E2=80=99re all taught that as good little programmers. Excep= t when > > performance won=E2=80=99t be a huge consideration since it=E2=80=99s no= t OLTP and easy of > > queryabilty is more important because it=E2=80=99s a growing system... = good > > decision - whoever made it. > > > > How does filename matter? Frankly I wish the filename was REQUIRED to = be > > the dag name so people would quit confusing themselves by mismatching > them > > ! We=E2=80=99ve renamed dag files with no issue as long as the conten= t doesn=E2=80=99t > > change, so again, not a real use case. And really - name your stuff > > careful before you get to prod man. > > > > I gotta ask - airflowuser - are you gonna use airflow for anything, or > > just poke it with a stick from a distance and ask semi-inane questions = of > > these fine folks that wrote and spend time working on this cool piece o= f > > kit? > > > > B > > > > Sent from a device with less than stellar autocorrect > > > > > On Sep 20, 2018, at 3:12 PM, Driesprong, Fokko > > wrote: > > > > > > I like the dag_id for both the name and as an unique identifier. If y= ou > > > change the dag in such a way, that it deserves a new name, you probab= ly > > > want to create a new dag anyway. If you want to give some additional > > > context, you can use the description field: > > > > > > https://github.com/apache/incubator-airflow/blob/master/airflow/models.py= #L3131-L3132 > > > > > > The name of the file of dag does not have any influence. > > > > > > My 2=C2=A2 > > > > > > Cheers, Fokko > > > > > > Op do 20 sep. 2018 om 19:40 schreef James Meickle > > > : > > > > > >> I'm personally against having some kind of auto-increment numeric ID > for > > >> DAGs. While this makes a lot of sense for systems where creation is = a > > >> database activity (like a POST request), in Airflow, DAG creation is > > >> actually a code ship activity. There are all kinds of complex > scenarios > > >> around that: > > >> > > >> - I revert a commit and a DAG disappears or is renamed > > >> - I run the same file, twice, with multiple parameters to create two > > DAGs > > >> - I create the DAG in both staging and prod, but they wind up with > > >> different IDs > > >> > > >> It's just too hard to automatically track these scenarios. > > >> > > >> If we really wanted to put something like this in place, it would > first > > >> make more sense to decouple DAG creation from code shipping, and > instead > > >> prefer creation of a DAG outside of code (but with a definition that > > >> references which git repo/committish/file/arguments/etc. to use). Th= en > > if > > >> you do something like rename a file, the DAG breaks, but at least > still > > >> exists in the db with that ID and history still makes sense once you > > update > > >> the DAG definition with the new code location. > > >> > > >> On Thu, Sep 20, 2018 at 4:52 AM airflowuser > > >> wrote: > > >> > > >>> Hi, > > >>> though this could have been explained on Jira I think this should b= e > > >>> discussed first. > > >>> > > >>> The problem: > > >>> Airflow mixes DAG name with id. It uses same filed for both purpose= s. > > >>> > > >>> I assume that most of you use the dag_id to describe what the DAG > > >> actually > > >>> does. > > >>> For example: > > >>> > > >>> dag =3D DAG( > > >>> dag_id=3D'cost_report_daily', > > >>> ... > > >>> ) > > >>> > > >>> This dag_id is reflected to the dag id column in the UI. > > >>> Now, lets say that you want to add another task to this specific da= g > - > > >> You > > >>> are to be extremely careful when you change the dag_id to represent > the > > >> new > > >>> functionality for example : dag_id=3D'cost_expenses_reports_daily' = . > This > > >>> will break the history of the DAG. > > >>> > > >>> Or even with simpler use case.. the user just want to change the na= me > > he > > >>> sees on the UI. > > >>> > > >>> I suggest to have a discussion if the dag_id should be split into i= d > > (an > > >>> actual id) and name to reflect what it does. When the "connection" = is > > >> done > > >>> by id's - names can change as much as you want without breaking > > >> anything. > > >>> essentially it becomes a field uses for display purpose only. > > >>> > > >>> * I didn't mention also the issue of DAG file name which can also > cause > > >>> trouble if someone wants to change it. > > >>> > > >>> Sent with [ProtonMail](https://protonmail.com) Secure Email. > > >> > > > --000000000000f825f70576a56352--