airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Driesprong, Fokko" <fo...@driesprong.frl>
Subject Re: [Discuss] Airflow sensor optimization
Date Wed, 06 Mar 2019 19:23:45 GMT
Thanks for bringing this up. I've added a comment on the Wiki:
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization

Have you looked into the work by Seelmann? Recently he introduced the
ability to reschedule sensors. When rescheduling, the slot will be given
back to the scheduler after a poke operation. Therefore the slot won't be
occupied all the time. The details are in the PR
https://github.com/apache/airflow/pull/3596

I would propose to make this the default behavior in Airflow 2.0.

Cheers, Fokko

Op wo 6 mrt. 2019 om 15:32 schreef Yingbo Wang <ybwang@gmail.com>:

> hi,
>
> I would like to open an AIP for Airflow sensor optimization.
>
>
> *Motivation*:
>
> Low efficiency in Airflow Sensor Implementation
>
> Sensors are a special kind of operator that will keep running until a
> certain criterion is met. Examples include a specific file landing in HDFS
> or S3, a partition appearing in Hive, or a specific time of the day.
> Sensors are derived from BaseSensorOperator and run a poke method at a
> specified poke_interval until it returns True.
>
> The reason that the sensor tasks are inefficient is because in current
> design, we sprawn a separate worker process for each partition sensor. This
> worker might last a long time, until the target partition is available.  In
> the case where there are many sensor tasks that need to run within certain
> time limits, we have to allocate a lot of resources to have enough workers
> for the sensor tasks.
>
> *Idea:*
>
> We propose two approaches that could address this issues, batch-sensor
> and smart-sensor.
>
>
>
> Batch-sensor
>
> The basic idea of batch-sensor is to batch process sensor tasks to save
> resources. During running, a batch-sensor will take N partition sensor
> requests as the input and poke those N partitions periodically. If the
> batch-sensor finds that the criteria of some sensor task is met, the
> batch-sensor will update the database about this sensor tasks.
>
>
> To do this, we need to create a sensor basic class called ‘batchable’ and
> make all sensors inherit from this basic class. We also need to change the
> behavior of schedule regarding a batchable sensor tasks. The schedule will
> find as many as possible batchable sensor tasks and run those tasks in a
> batch.
>
>
> Smart-sensor
>
> Smart-sensor is an improvement on top of batch-sensor.
>
> The idea of smart-sensor is that the worker process of smart-sensor will
> run like a service. To do this, we need to persist Sensor details in
> Airflow DB and the worker process periodically queries task-instance table
> to find sensor tasks; poke the metastore and update the task instance table
> if it detects that certain partition or file created.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message