From dev-return-5015-archive-asf-public=cust-asf.ponee.io@airflow.incubator.apache.org Sun May 6 09:13:32 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id AC7AB180674 for ; Sun, 6 May 2018 09:13:31 +0200 (CEST) Received: (qmail 17615 invoked by uid 500); 6 May 2018 07:13:29 -0000 Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list dev@airflow.incubator.apache.org Received: (qmail 17598 invoked by uid 99); 6 May 2018 07:13:29 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 May 2018 07:13:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 80B74180576 for ; Sun, 6 May 2018 07:13:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.998 X-Spam-Level: * X-Spam-Status: No, score=1.998 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gerar-do.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id puiKaRsqSEeQ for ; Sun, 6 May 2018 07:13:26 +0000 (UTC) Received: from mail-pf0-f178.google.com (mail-pf0-f178.google.com [209.85.192.178]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id A13595F1F3 for ; Sun, 6 May 2018 07:13:25 +0000 (UTC) Received: by mail-pf0-f178.google.com with SMTP id c10so20446068pfi.12 for ; Sun, 06 May 2018 00:13:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gerar-do.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=FT5gw/2ip/zx/oKTQtmxPCkfC/cK2eqerZ1pCW+ShP4=; b=YqDdbnGc0zZ/S1gBfge1aeGbGJWaAhMkeGNZeQ+G6xAPCuQumLAl+b3jr/TAvL89N6 boT3G6sj8NO1GsW9oObSntQZ/yvDKbCF5camRxsVVZ1OvL/RiaE5xEjH3TyHh6d64InU HDWXH+ayR2wY8yeXNcFW8dHt+jG8ANbt9kEnC0L335N5vTCIqTooBHS4PDP7PCK4jjbt AmkBplTkd7YJxajGnEMgyY4YYczhafOZxJpzONYkg8rM085M4OZbiV4VN9Pl60ev/HH4 T+0+EPpp9Avd3YIp0OrTsCeabuOrtJArM0lSAJKAx5QW7Y5i4KNq/2FZvxK9pG13IIHF iPLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=FT5gw/2ip/zx/oKTQtmxPCkfC/cK2eqerZ1pCW+ShP4=; b=DZfAh84RZePSeOCjH0SvBDMKU/FiCCBwJqh9U+7iNrF+tQj/9E3WJ8ZrWNkWajCHxa 55tKP0D9Zoo2zfk8gONEJi6p6CyPGpoJu0NNMzaczHU8w7T7YUP8ehpDlyP8l02Pa+PC TbYkvTLmGAOLX046P2vc2aXqmZ1e77nPfHBCq50Px2blTsMkl28q4yaEHYe6iyerDhRf eNElAiuUlZKnaq8FGay1OSzwC71K+/m8wkdrNOtpk4dzk3iKUJKNjtKl2MaisFFkJc+r qydoOC/anQViNYDekvj/5MOoKi9Io1GKvA1/qzDohhedR+WYX69BVkSN+l2E7fQ+7FW+ jx2A== X-Gm-Message-State: ALQs6tBPllWRljzC1A+2bP+PIx+F0cHNnRw2ZaTYI+aPzVpbEO9XAUrm GrLK0DxzJLazXaOSSHnlicondyuCuhd4ig7EO4T0JrwM X-Google-Smtp-Source: AB8JxZoqiGB/jc3pc47Zhe+xRVZKLbLSx1saLbXb1Bs3JUVm0GXej21l8kZp6VSU2p/EOkUajCLWKUA89W+xNOq8HY4= X-Received: by 10.98.8.69 with SMTP id c66mr20824635pfd.189.1525590797613; Sun, 06 May 2018 00:13:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.170.138 with HTTP; Sun, 6 May 2018 00:13:17 -0700 (PDT) In-Reply-To: References: From: Gerardo Curiel Date: Sun, 6 May 2018 17:13:17 +1000 Message-ID: Subject: Re: Lineage To: dev@airflow.incubator.apache.org Content-Type: multipart/alternative; boundary="001a1143d76a908dee056b8449e1" --001a1143d76a908dee056b8449e1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Bolke, Data lineage support sounds very interesting. I'm not very familiar with Atlas but first sight seems like a tool specific to the Hadoop ecosystem. How would this look like if the files (inlets or outlets) were stored on s3?. An example of a service that manages a similar use case is AWS Glue[1], which creates a hive metastore based on the schema and other metadata it can get from different sources (amongst them, s3 files). On Sun, May 6, 2018 at 7:49 AM, Bolke de Bruin wrote: > Hi All, > > I have made a first implementation that allows tracking of lineage in > Airflow and integration with Apache Atlas. It was inspired by Jeremiah=E2= =80=99s > work in the past on Data Flow pipelines, but I think I kept it a little b= it > simpler. > > Operators now have two new parameters called =E2=80=9Cinlets=E2=80=9D and= =E2=80=9Coutlets=E2=80=9D. These > can be filled with objects derived from =E2=80=9CDataSet=E2=80=9D, like = =E2=80=9CFile=E2=80=9D and > =E2=80=9CHadoopFile=E2=80=9D. Parameters are jinja2 templated, which > means they receive the context of the task when it is running and get > rendered. So you can get definitions like this: > > f_final =3D File(name=3D"/tmp/final") > run_this_last =3D DummyOperator(task_id=3D'run_this_last', dag=3Ddag, > inlets=3D{"auto": True}, > outlets=3D{"datasets": [f_final,]}) > > f_in =3D File(name=3D"/tmp/whole_directory/") > outlets =3D [] > for file in FILE_CATEGORIES: > f_out =3D File(name=3D"/tmp/{}/{{{{ execution_date }}}}".format(file)= ) > outlets.append(f_out) > run_this =3D BashOperator( > task_id=3D'run_after_loop', bash_command=3D'echo 1', dag=3Ddag, > inlets=3D{"auto": False, "task_ids": [], "datasets": [f_in,]}, > outlets=3D{"datasets": outlets} > ) > run_this.set_downstream(run_this_last) > > So I am trying to keep to boilerplate work down for developers. Operators > can also extend inlets and outlets automatically. This will probably be a > bit harder for the BashOperator without some special magic, but an update > to the DruidOperator can be relatively quite straightforward. > > In the future Operators can take advantage of the inlet/outlet definition= s > as they are also made available as part of the context for templating (as > =E2=80=9Cinlets=E2=80=9D and =E2=80=9Coutlets=E2=80=9D). > > I=E2=80=99m looking forward to your comments! > > https://github.com/apache/incubator-airflow/pull/3321 > > Bolke. [1] https://aws.amazon.com/glue/ Cheers, --=20 Gerardo Curiel // https://gerar.do --001a1143d76a908dee056b8449e1--