airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Yang <yrql...@gmail.com>
Subject Re: [Discuss] Airflow sensor optimization
Date Fri, 08 Mar 2019 01:09:42 GMT
Thank Yingbo for starting this and everyone for joining the discussion,
great point about sharding. This would be really useful for large scale
clusters.

I image at the first stage we can reuse the existing logic and make the
smart sensor a special kind of operator( maybe even make scheduler treat it
differently, e.g. more aggressive zombie check). Later we can expand it to
take over all sensor traffic and become an independent component. It can be
the place we put together all our custom sensor control logic, e.g. dedup,
and solve the resource usage issue on both dimensions w/o add anymore load
to scheduler and DB.

Looking forward to review your PR!

Cheers,
Kevin Y

On Thu, Mar 7, 2019 at 3:50 PM Elser Rosa Leiva <elserrosaleiva@gmail.com>
wrote:

> On 2019/03/06 14:31:57, Yingbo Wang <y...@gmail.com> wrote:
> > hi,>
> >
> > I would like to open an AIP for Airflow sensor optimization.>
> >
> >
> > *Motivation*:>
> >
> > Low efficiency in Airflow Sensor Implementation>
> >
> > Sensors are a special kind of operator that will keep running until a>
> > certain criterion is met. Examples include a specific file landing in
> HDFS>
> > or S3, a partition appearing in Hive, or a specific time of the day.>
> > Sensors are derived from BaseSensorOperator and run a poke method at a>
> > specified poke_interval until it returns True.>
> >
> > The reason that the sensor tasks are inefficient is because in current>
> > design, we sprawn a separate worker process for each partition sensor.
> This>
> > worker might last a long time, until the target partition is available.
> In>
> > the case where there are many sensor tasks that need to run within
> certain>
> > time limits, we have to allocate a lot of resources to have enough
> workers>
> > for the sensor tasks.>
> >
> > *Idea:*>
> >
> > We propose two approaches that could address this issues, batch-sensor>
> > and smart-sensor.>
> >
> >
> >
> > Batch-sensor>
> >
> > The basic idea of batch-sensor is to batch process sensor tasks to save>
> > resources. During running, a batch-sensor will take N partition sensor>
> > requests as the input and poke those N partitions periodically. If the>
> > batch-sensor finds that the criteria of some sensor task is met, the>
> > batch-sensor will update the database about this sensor tasks.>
> >
> >
> > To do this, we need to create a sensor basic class called ‘batchable’
> and>
> > make all sensors inherit from this basic class. We also need to change
> the>
> > behavior of schedule regarding a batchable sensor tasks. The schedule
> will>
> > find as many as possible batchable sensor tasks and run those tasks in a>
> > batch.>
> >
> >
> > Smart-sensor>
> >
> > Smart-sensor is an improvement on top of batch-sensor.>
> >
> > The idea of smart-sensor is that the worker process of smart-sensor will>
> > run like a service. To do this, we need to persist Sensor details in>
> > Airflow DB and the worker process periodically queries task-instance
> table>
> > to find sensor tasks; poke the metastore and update the task instance
> table>
> > if it detects that certain partition or file created.>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message