airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elser Rosa Leiva <elserrosale...@gmail.com>
Subject Re: [Discuss] Airflow sensor optimization
Date Thu, 07 Mar 2019 18:35:16 GMT
On 2019/03/06 14:31:57, Yingbo Wang <y...@gmail.com> wrote:
> hi,>
>
> I would like to open an AIP for Airflow sensor optimization.>
>
>
> *Motivation*:>
>
> Low efficiency in Airflow Sensor Implementation>
>
> Sensors are a special kind of operator that will keep running until a>
> certain criterion is met. Examples include a specific file landing in
HDFS>
> or S3, a partition appearing in Hive, or a specific time of the day.>
> Sensors are derived from BaseSensorOperator and run a poke method at a>
> specified poke_interval until it returns True.>
>
> The reason that the sensor tasks are inefficient is because in current>
> design, we sprawn a separate worker process for each partition sensor.
This>
> worker might last a long time, until the target partition is available.
In>
> the case where there are many sensor tasks that need to run within
certain>
> time limits, we have to allocate a lot of resources to have enough
workers>
> for the sensor tasks.>
>
> *Idea:*>
>
> We propose two approaches that could address this issues, batch-sensor>
> and smart-sensor.>
>
>
>
> Batch-sensor>
>
> The basic idea of batch-sensor is to batch process sensor tasks to save>
> resources. During running, a batch-sensor will take N partition sensor>
> requests as the input and poke those N partitions periodically. If the>
> batch-sensor finds that the criteria of some sensor task is met, the>
> batch-sensor will update the database about this sensor tasks.>
>
>
> To do this, we need to create a sensor basic class called ‘batchable’
and>
> make all sensors inherit from this basic class. We also need to change
the>
> behavior of schedule regarding a batchable sensor tasks. The schedule
will>
> find as many as possible batchable sensor tasks and run those tasks in a>
> batch.>
>
>
> Smart-sensor>
>
> Smart-sensor is an improvement on top of batch-sensor.>
>
> The idea of smart-sensor is that the worker process of smart-sensor will>
> run like a service. To do this, we need to persist Sensor details in>
> Airflow DB and the worker process periodically queries task-instance
table>
> to find sensor tasks; poke the metastore and update the task instance
table>
> if it detects that certain partition or file created.>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message