airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From harish singh <harish.sing...@gmail.com>
Subject When to use pools?
Date Mon, 20 Jun 2016 21:46:03 GMT
Hi,

We have been using airflow for few 3 months now.

One pain I felt was, during backfill if I have 2 tasks t1 and t2 - with t1
having depends_on_past=true,
              t0 -> t1
              t0 -> t2

I find that the task t2 with no past dependency keeps getting scheduled.
This causes the task t1 to wait for a long time before it gets scheduled.

I think this is a good use case for creating "pools" and allocate slots for
each pool.
Also, I will have to use priority_weights.  And adjust parallelism!!!

Is there a better way to handle this?


Also, in general, are there any examples on how to use pools?

I peeked into* airflow/tests/operators/subdag_operator.py *and found the
below snippet:

session = airflow.settings.Session()
pool_1 = airflow.models.Pool(pool='test_pool_1', slots=1)
session.add(pool_1)
session.commit()

Why do we need Session instance? Do we need to run the below code before
creating a pool in code (inside my pipeline.py under dags/ directory):

*pool = (
    session.query(Pool)
    .filter(Pool.pool == 'AIRFLOW-205')
    .first())
if not pool:
    session.add(Pool(pool='AIRFLOW-205', slots=8))
    session.commit()*


Also, I saw few places where pool: 'backfill'  is used?

Is 'backfill' a special pre-defined pool?


If not, how do we create different types of pools based on whether it
is backfill or not?


All this is being done in pipeline.py script under 'dags/' directory.


Thanks,
Harish

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message