From dev-return-5072-archive-asf-public=cust-asf.ponee.io@airflow.incubator.apache.org Fri May 11 14:23:29 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id E2B7C180647 for ; Fri, 11 May 2018 14:23:28 +0200 (CEST) Received: (qmail 17236 invoked by uid 500); 11 May 2018 12:23:27 -0000 Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list dev@airflow.incubator.apache.org Received: (qmail 17224 invoked by uid 99); 11 May 2018 12:23:27 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 May 2018 12:23:27 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id AE053C1A65 for ; Fri, 11 May 2018 12:23:26 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.889 X-Spam-Level: * X-Spam-Status: No, score=1.889 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id k4A9hb8zyx_S for ; Fri, 11 May 2018 12:23:25 +0000 (UTC) Received: from mail-pl0-f49.google.com (mail-pl0-f49.google.com [209.85.160.49]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 8D2195F4AA for ; Fri, 11 May 2018 12:23:24 +0000 (UTC) Received: by mail-pl0-f49.google.com with SMTP id i5-v6so3224475plt.2 for ; Fri, 11 May 2018 05:23:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=PrVneADYc7k9VCdf5QmyqHiqx5KPShnFadHqj3/wBog=; b=l3gcDaw9X7FGiFM/YgbIKh7Eo05sZ7Yot1PtBbkOipCJj78WGKcNb/0fOw8pXEVsQG qicVqcgB8HMje+jaxeWAmcYgvZSjhhL+OJ5RbFgAyige7aFFC/2aVSqw8dVOj8lr7EBi sm8ZQnAEsuXd2n98HmvhIrYUVqM9g4mTXCu4spFCpUZCg73w6MANceVGmph4o8BqkSQQ xVVae8h9wDjy3xerbo6nXyzGz8HP6rOlUU0cFLAQqt1evdnRyl7Xd5hGBIOI0bIhsfj3 mLAA3UpPzxwzqcrrMU7ftMwhdBGQ7iLAQZJWXHMSmYEBQc0N/3J0gIONHWPpLc4yiPiU cKBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=PrVneADYc7k9VCdf5QmyqHiqx5KPShnFadHqj3/wBog=; b=rGGFIFzyHPZiegGZB3kqDkBx9wA8x5wOxUHDVWewTpLs5/bMapdDrB8pgrsGONgJDs C+/f3ak2BsFK4mbIbM+g9RhzULnxO91DdFjIrtBGf9W7oAMQI5D1KsIDpvfZSs0YgzF0 gIN/PVS8f9EZX6G2mLsdIz9iR0m9sJ+wFf+Y5yKEb11Pk7W/M6Q5ST3M9c8iM5tOMF1t 65MjAPFtbqe1+nehIDzyVO7tscIWu1lSO9/xXHKbB6EdciIQlyTa4JgVbgvPP44mc+xK HWeTN7ivS2EUJSwv57Y8bwNTQWUNlfPJGk3us/z1sIrOXb8KyBxa2oReRyC5yzEzqW2C FKUw== X-Gm-Message-State: ALKqPwePiaiw+K/Bb7FTOgrBy5phJDJy4/+HT2Zx3MNVcY7LSeOaKCNh hMZeMda7IX9JvTtQH8uiwT6nNOpmN7h4rSdudR7NeMOe X-Google-Smtp-Source: AB8JxZpA/RgdmbAVuT/YCWHt00TCxw69fiY/2P/0j7IEWTRTg6zeqG0Rn1F2xL2AX2LP5aaPmmltukC7+mmSkLzvzYk= X-Received: by 2002:a17:902:2804:: with SMTP id e4-v6mr5289761plb.153.1526041402898; Fri, 11 May 2018 05:23:22 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.200.9 with HTTP; Fri, 11 May 2018 05:23:22 -0700 (PDT) From: Gerard Toonstra Date: Fri, 11 May 2018 14:23:22 +0200 Message-ID: Subject: Processing of files: best practices for airflow To: dev@airflow.incubator.apache.org Content-Type: multipart/alternative; boundary="000000000000bb7c8f056bed33c6" --000000000000bb7c8f056bed33c6 Content-Type: text/plain; charset="UTF-8" Hi all, I have a question regarding the processing of individual files: We collect some flat files from different sources in csv, raw and unstructured formats. These files are stored in a "{process}/YYYY/MM/DD/" hierarchy and we've built a GCSToGCSTransform operator, which runs a download/transform/upload loop on each file in the directory. This works ok, but I get the impression that the DAG is getting a bit messy from that and because it's contained in each dag, I see very little potential for code reusability. We have some suggestions and they mention writing some libraries and callable script files, so that the functionality can be leveraged across multiple dags. I can also imagine that some may be writing docker containers for that and run these containers on the cloud, instructing where to get the files and put the results. So I'm wondering if anyone found effective ways to deal with that and what is considered best practice for airflow? Rgds, Gerard --000000000000bb7c8f056bed33c6--