airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From harish singh <>
Subject When to use pools?
Date Mon, 20 Jun 2016 21:46:03 GMT

We have been using airflow for few 3 months now.

One pain I felt was, during backfill if I have 2 tasks t1 and t2 - with t1
having depends_on_past=true,
              t0 -> t1
              t0 -> t2

I find that the task t2 with no past dependency keeps getting scheduled.
This causes the task t1 to wait for a long time before it gets scheduled.

I think this is a good use case for creating "pools" and allocate slots for
each pool.
Also, I will have to use priority_weights.  And adjust parallelism!!!

Is there a better way to handle this?

Also, in general, are there any examples on how to use pools?

I peeked into* airflow/tests/operators/ *and found the
below snippet:

session = airflow.settings.Session()
pool_1 = airflow.models.Pool(pool='test_pool_1', slots=1)

Why do we need Session instance? Do we need to run the below code before
creating a pool in code (inside my under dags/ directory):

*pool = (
    .filter(Pool.pool == 'AIRFLOW-205')
if not pool:
    session.add(Pool(pool='AIRFLOW-205', slots=8))

Also, I saw few places where pool: 'backfill'  is used?

Is 'backfill' a special pre-defined pool?

If not, how do we create different types of pools based on whether it
is backfill or not?

All this is being done in script under 'dags/' directory.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message