airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yingbo Wang <ybw...@gmail.com>
Subject Re: [Discuss] Airflow sensor optimization
Date Wed, 06 Mar 2019 19:30:56 GMT
Wow, Great work from Seelmann! Thanks Fokko for letting us know it. We are
super happy to have this feature.

On Wed, Mar 6, 2019 at 11:24 AM Driesprong, Fokko <fokko@driesprong.frl>
wrote:

> Thanks for bringing this up. I've added a comment on the Wiki:
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization
>
> Have you looked into the work by Seelmann? Recently he introduced the
> ability to reschedule sensors. When rescheduling, the slot will be given
> back to the scheduler after a poke operation. Therefore the slot won't be
> occupied all the time. The details are in the PR
> https://github.com/apache/airflow/pull/3596
>
> I would propose to make this the default behavior in Airflow 2.0.
>
> Cheers, Fokko
>
> Op wo 6 mrt. 2019 om 15:32 schreef Yingbo Wang <ybwang@gmail.com>:
>
> > hi,
> >
> > I would like to open an AIP for Airflow sensor optimization.
> >
> >
> > *Motivation*:
> >
> > Low efficiency in Airflow Sensor Implementation
> >
> > Sensors are a special kind of operator that will keep running until a
> > certain criterion is met. Examples include a specific file landing in
> HDFS
> > or S3, a partition appearing in Hive, or a specific time of the day.
> > Sensors are derived from BaseSensorOperator and run a poke method at a
> > specified poke_interval until it returns True.
> >
> > The reason that the sensor tasks are inefficient is because in current
> > design, we sprawn a separate worker process for each partition sensor.
> This
> > worker might last a long time, until the target partition is available.
> In
> > the case where there are many sensor tasks that need to run within
> certain
> > time limits, we have to allocate a lot of resources to have enough
> workers
> > for the sensor tasks.
> >
> > *Idea:*
> >
> > We propose two approaches that could address this issues, batch-sensor
> > and smart-sensor.
> >
> >
> >
> > Batch-sensor
> >
> > The basic idea of batch-sensor is to batch process sensor tasks to save
> > resources. During running, a batch-sensor will take N partition sensor
> > requests as the input and poke those N partitions periodically. If the
> > batch-sensor finds that the criteria of some sensor task is met, the
> > batch-sensor will update the database about this sensor tasks.
> >
> >
> > To do this, we need to create a sensor basic class called ‘batchable’ and
> > make all sensors inherit from this basic class. We also need to change
> the
> > behavior of schedule regarding a batchable sensor tasks. The schedule
> will
> > find as many as possible batchable sensor tasks and run those tasks in a
> > batch.
> >
> >
> > Smart-sensor
> >
> > Smart-sensor is an improvement on top of batch-sensor.
> >
> > The idea of smart-sensor is that the worker process of smart-sensor will
> > run like a service. To do this, we need to persist Sensor details in
> > Airflow DB and the worker process periodically queries task-instance
> table
> > to find sensor tasks; poke the metastore and update the task instance
> table
> > if it detects that certain partition or file created.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message