Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 17251200C61 for ; Tue, 25 Apr 2017 09:02:36 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 159FB160BB3; Tue, 25 Apr 2017 07:02:36 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 355D2160B9E for ; Tue, 25 Apr 2017 09:02:35 +0200 (CEST) Received: (qmail 84266 invoked by uid 500); 25 Apr 2017 07:02:34 -0000 Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list dev@airflow.incubator.apache.org Received: (qmail 84253 invoked by uid 99); 25 Apr 2017 07:02:34 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Apr 2017 07:02:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id A2E96CCF82 for ; Tue, 25 Apr 2017 07:02:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.121 X-Spam-Level: X-Spam-Status: No, score=-0.121 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id PWvqgvYRI1I0 for ; Tue, 25 Apr 2017 07:02:32 +0000 (UTC) Received: from mail-wm0-f42.google.com (mail-wm0-f42.google.com [74.125.82.42]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id E68D95FC91 for ; Tue, 25 Apr 2017 07:02:31 +0000 (UTC) Received: by mail-wm0-f42.google.com with SMTP id u65so14290412wmu.1 for ; Tue, 25 Apr 2017 00:02:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:content-transfer-encoding:mime-version:subject:date:references :to:in-reply-to:message-id; bh=BElTxvwu5UH6XCkw6TR2P45xwn+1ZtN7ov3zIpRSWsY=; b=NbG7AAYzId4TDwGSnDL4mDjPf64gIr08sldkgI6bSJ79EIk+4LVmK/tyvPfNKM638Q ed8qTvgrJf3kBCz6CijGtFpWhRctqE7rLf4ijmwKaXhxtrHK+3JKuVscb7wdRKj+Qg3l GR7s7E84IXtFN9+XHyGKi59yZvr7UV4u7HbIQwj4n/84TDPvOqBJy6bXhgNH05hWM5ol sMiLS6kNVYLdP4iZf+oSxRuYq4WuY91dw5DI9UCd7Ny1foUlqc/vfMeL9MnxM5gRd5JQ j2MTDRF0MEwbWNTfrsebk3OBzcJsPtgw9vpMZ5/C1WLFq/5AWoBP2bnIpOU4k2IqM72+ 89xQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:content-transfer-encoding:mime-version :subject:date:references:to:in-reply-to:message-id; bh=BElTxvwu5UH6XCkw6TR2P45xwn+1ZtN7ov3zIpRSWsY=; b=bUkIX6iImSRM8lQPH6y0J6dBlnwW+hi02edT4yoQD/NJXLHGB1lOWH/qPzMjGSVkHL USGATcVwn4Si6964SuBZl4hAwzpZ2kMT/3LoNzLDvMUmWIrJV6xQT9uqT9V8/SUcxELZ u4cJqBQ4ElnpKuTIKEcs/zbCMSXWVLH56pRqIbRib7yCPqXXql+GCTWyCbdku+7xLm5L UtTI5zOlxClQfgipzeDkIN3I5n9PceWGuqCZBv+mDnKMn38YO2mF4FWxkJqS4sLoSq9c JdtCq0sG0pNcOq1zrdIwDVh5kAtVUkmkysQb5HcCP5xZ/CDc/bw5vFFSmjjjYBvmQsfE +OJQ== X-Gm-Message-State: AN3rC/5Cubj0C/70K7eL76pqSl6dkTgi+TI/ONUq0neTqUXlMwEMhPW/ YvKJATieAullVsLzwzrwaQ== X-Received: by 10.80.170.151 with SMTP id q23mr3944314edc.72.1493103750552; Tue, 25 Apr 2017 00:02:30 -0700 (PDT) Received: from [10.254.254.3] (89.20.160.55.static.ef-service.nl. [89.20.160.55]) by smtp.gmail.com with ESMTPSA id m53sm4685919edc.29.2017.04.25.00.02.29 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 25 Apr 2017 00:02:29 -0700 (PDT) From: Bolke de Bruin Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: dag file processing times Date: Tue, 25 Apr 2017 09:01:56 +0200 References: <480E8A5A-5F84-4F45-A2AA-6D1467C5C904@gmail.com> To: dev@airflow.incubator.apache.org In-Reply-To: Message-Id: X-Mailer: Apple Mail (2.3273) archived-at: Tue, 25 Apr 2017 07:02:36 -0000 We could of course write a module loader that takes care of the caching = and maybe even the manifest. This would help with versioning and could = look a bit like the java class loader (by separating the imported = modules or making sure we always load the modules when loading dags). = Didn=E2=80=99t think about repercussions so there might be severe cons. = Please note that I don=E2=80=99t think the multiprocess processor solves = the sys.modules issue entirely: cached modules in the parent will still = be there, so any dependencies the airflow scheduler itself brings in = will be in the processor. It is probably enough in 99% of the = circumstances though. On the config issue I don=E2=80=99t entirely agree. If you have a config = that is available outside your dag, this will still be loaded if you do = not use serialisation. Strengthening my case for just sending DAGs (and = if needed dependencies) around and not use pickling/serialization (btw = the on the wire format of marshmallow is json). Bolke. > On 25 Apr 2017, at 01:09, Maxime Beauchemin = wrote: >=20 > With configuration as code, you can't really know whether the DAG > definition has changed based on whether the module was altered. This = python > module could be importing other modules that have been changed, could = have > read a config file somewhere on the drive that might have changed, or = read > from a DB that is constantly getting mutated. >=20 > There are also issues around the fact that Python caches modules in > `sys.modules`, so even though the crawler is re-interpreting modules, > imported modules wouldn't get re-interpreted [as our DAG authors = expected] >=20 > For these reasons [and others I won't get into here], we decided that = the > scheduler would use a subprocess pool and re-interpret the DAGs from > scratch at every cycle, insulating the different DAGs and guaranteeing = no > interpreter caching. >=20 > Side note: yaml parsing is much more expensive than other markup = languages > and would recommend working around it to store DAG configuration. Our > longest-to-parse DAGs at Airbnb were reading yaml to build build a = DAG, and > I believe someone wrote custom logic to avoid reparsing the yaml at = every > cycle. Parsing equivalent json or hocon was an order of magnitude = faster. >=20 > Max >=20 > On Mon, Apr 24, 2017 at 2:55 PM, Bolke de Bruin = wrote: >=20 >> Inotify can work without a daemon. Just fire a call to the API when a = file >> changes. Just a few lines in bash. >>=20 >> If you bundle you dependencies in a zip you should be fine with the = above. >> Or if we start using manifests that list the files that are needed in = a >> dag... >>=20 >>=20 >> Sent from my iPhone >>=20 >>> On 24 Apr 2017, at 22:46, Dan Davydov = >> wrote: >>>=20 >>> One idea to solve this is to use a daemon that uses inotify to watch = for >>> changes in files and then reprocesses just those files. The hard = part is >>> without any kind of dependency/build system for DAGs it can be hard = to >> tell >>> which DAGs depend on which files. >>>=20 >>> On Mon, Apr 24, 2017 at 1:21 PM, Gerard Toonstra = >>> wrote: >>>=20 >>>> Hey, >>>>=20 >>>> I've seen some people complain about DAG file processing times. An = issue >>>> was raised about this today: >>>>=20 >>>> https://issues.apache.org/jira/browse/AIRFLOW-1139 >>>>=20 >>>> I attempted to provide a good explanation what's going on. Feel = free to >>>> validate and comment. >>>>=20 >>>>=20 >>>> I'm noticing that the file processor is a bit naive in the way it >>>> reprocesses DAGs. It doesn't look at the DAG interval for example, = so it >>>> looks like it reprocesses all files continuously in one big batch, = even >> if >>>> we can determine that the next "schedule" for all its dags are in = the >>>> future? >>>>=20 >>>>=20 >>>> Wondering if a change in the DagFileProcessingManager could = optimize >> things >>>> a bit here. >>>>=20 >>>> In the part where it gets the simple_dags from a file it's = currently >>>> processing: >>>>=20 >>>> for simple_dag in processor.result: >>>> simple_dags.append(simple_dag) >>>>=20 >>>> the file_path is in the context and the simple_dags should be able = to >>>> provide the next interval date for each dag in the file. >>>>=20 >>>> The idea is to add files to a sorted deque by = "next_schedule_datetime" >> (the >>>> minimum next interval date), so that when we build the list >>>> "files_paths_to_queue", it can remove files that have dags that we = know >>>> won't have a new dagrun for a while. >>>>=20 >>>> One gotcha to resolve after that is to deal with files getting = updated >> with >>>> new dags or changed dag definitions and renames and different = interval >>>> schedules. >>>>=20 >>>> Worth a PR to glance over? >>>>=20 >>>> Rgds, >>>>=20 >>>> Gerard >>>>=20 >>=20