airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Huynh <ahu...@symphonyrm.com>
Subject Re: Ignore Processing DAG Definition Python Files for Paused DAGs
Date Mon, 27 Nov 2017 23:29:53 GMT
When we updated to Airflow 1.9, we noticed that there was a pretty big delay between tasks
(somewhere between 2-4 minutes, even after playing around with the min_file_process_interval
and max_threads configs). Our thought was that if we reduce the number of files that the scheduler
has to process, then the scheduler would process files for unpaused DAGs more frequently,
reducing the delay between tasks.

On 2017-11-27 11:23, Alek Storm <alek.storm@gmail.com> wrote: 
> What's the advantage of this change? Performance?
> 
> Alek
> 
> On Mon, Nov 27, 2017 at 1:11 PM, ahuynh@symphonyrm.com <
> ahuynh@symphonyrm.com> wrote:
> 
> > Hi all,
> >
> > I wanted to gauge community interest in this idea we have. We are
> > currently running a modified version of Airflow 1.9 RC3 where we ignore
> > processing DAG definition Python files for paused DAGs. By default,
> > list_py_file_paths traverses the dags subdirectory to look for Python
> > files, and the scheduler processes all these files, regardless of whether
> > the DAGs defined in these files are paused or not. Our proposed
> > modification was to query the fileloc column in the dag table, filtering
> > on is_paused=1 and is_active=1 to get a list of file paths for paused DAGs.
> > Then, we can exclude these files from the known_file_paths, so that the
> > scheduler does not process these files. This feature can be set on and off
> > via a scheduler config variable.
> >
> > If anyone is interested, we already have the code written, so we'd be
> > happy to package up our changes and create a PR.
> >
> > Thanks!
> > -Andy
> >
> 
Mime
View raw message