airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yingbo Wang <ybw...@gmail.com>
Subject Re: [Discuss] Airflow sensor optimization
Date Thu, 07 Mar 2019 17:40:36 GMT
There are two dimension to evaluate how much resource all sensors take in
Airflow: the number of sensors and the duration of each sensor task take.
Batch/smart sensor idea is proposed for the first one and the rescheduling
is for the second one. For airflow cluster running large number of sensor
tasks, the batch/smart sensor use less than 10% of sensor resource compared
with regular sensor.

On Thu, Mar 7, 2019 at 2:36 AM Ash Berlin-Taylor <ash@apache.org> wrote:

> Rescheduling is of massive use for a DAG where we are waiting for a weekly
> S3 file delivery from a third party supplier with _massive_ variance in the
> delivery time. It'll appear at some point between Thursday AM and Sunday
> evening. Not having an executor slot tied up with the S3KeySensor is great
> for this.
>
> -ash
>
> > On 6 Mar 2019, at 21:51, Alex Guziel <alex.guziel@airbnb.com.INVALID>
> wrote:
> >
> > Smart sensor seems like a good idea, but I wonder how much performance
> will
> > be improved in practice. And of course, one must think about sharding and
> > such.
> >
> > I'm not sure how helpful rescheduling sensors is, since it will add
> > scheduler and DB load seemingly, which is already a bottleneck.
> >
> > On Wed, Mar 6, 2019 at 12:43 PM Yingbo Wang <ybwang@gmail.com> wrote:
> >
> >> I would still like to get some feedback on the batch sensor/smart sensor
> >> idea after viewing the sensor rescheduling PR. Since the reschedule mode
> >> does not reduce the number of worker processes for sensor. The batch
> sensor
> >> idea is proposed for this purpose and should work well with reschedule
> >> mode.
> >>
> >> On Wed, Mar 6, 2019 at 11:30 AM Yingbo Wang <ybwang@gmail.com> wrote:
> >>
> >>> Wow, Great work from Seelmann! Thanks Fokko for letting us know it. We
> >> are
> >>> super happy to have this feature.
> >>>
> >>> On Wed, Mar 6, 2019 at 11:24 AM Driesprong, Fokko <fokko@driesprong.frl
> >
> >>> wrote:
> >>>
> >>>> Thanks for bringing this up. I've added a comment on the Wiki:
> >>>>
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization
> >>>>
> >>>> Have you looked into the work by Seelmann? Recently he introduced the
> >>>> ability to reschedule sensors. When rescheduling, the slot will be
> given
> >>>> back to the scheduler after a poke operation. Therefore the slot won't
> >> be
> >>>> occupied all the time. The details are in the PR
> >>>> https://github.com/apache/airflow/pull/3596
> >>>>
> >>>> I would propose to make this the default behavior in Airflow 2.0.
> >>>>
> >>>> Cheers, Fokko
> >>>>
> >>>> Op wo 6 mrt. 2019 om 15:32 schreef Yingbo Wang <ybwang@gmail.com>:
> >>>>
> >>>>> hi,
> >>>>>
> >>>>> I would like to open an AIP for Airflow sensor optimization.
> >>>>>
> >>>>>
> >>>>> *Motivation*:
> >>>>>
> >>>>> Low efficiency in Airflow Sensor Implementation
> >>>>>
> >>>>> Sensors are a special kind of operator that will keep running until
a
> >>>>> certain criterion is met. Examples include a specific file landing
in
> >>>> HDFS
> >>>>> or S3, a partition appearing in Hive, or a specific time of the
day.
> >>>>> Sensors are derived from BaseSensorOperator and run a poke method
at
> a
> >>>>> specified poke_interval until it returns True.
> >>>>>
> >>>>> The reason that the sensor tasks are inefficient is because in
> current
> >>>>> design, we sprawn a separate worker process for each partition
> sensor.
> >>>> This
> >>>>> worker might last a long time, until the target partition is
> >>>> available.  In
> >>>>> the case where there are many sensor tasks that need to run within
> >>>> certain
> >>>>> time limits, we have to allocate a lot of resources to have enough
> >>>> workers
> >>>>> for the sensor tasks.
> >>>>>
> >>>>> *Idea:*
> >>>>>
> >>>>> We propose two approaches that could address this issues,
> batch-sensor
> >>>>> and smart-sensor.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Batch-sensor
> >>>>>
> >>>>> The basic idea of batch-sensor is to batch process sensor tasks
to
> >> save
> >>>>> resources. During running, a batch-sensor will take N partition
> sensor
> >>>>> requests as the input and poke those N partitions periodically.
If
> the
> >>>>> batch-sensor finds that the criteria of some sensor task is met,
the
> >>>>> batch-sensor will update the database about this sensor tasks.
> >>>>>
> >>>>>
> >>>>> To do this, we need to create a sensor basic class called ‘batchable’
> >>>> and
> >>>>> make all sensors inherit from this basic class. We also need to
> change
> >>>> the
> >>>>> behavior of schedule regarding a batchable sensor tasks. The schedule
> >>>> will
> >>>>> find as many as possible batchable sensor tasks and run those tasks
> >> in a
> >>>>> batch.
> >>>>>
> >>>>>
> >>>>> Smart-sensor
> >>>>>
> >>>>> Smart-sensor is an improvement on top of batch-sensor.
> >>>>>
> >>>>> The idea of smart-sensor is that the worker process of smart-sensor
> >> will
> >>>>> run like a service. To do this, we need to persist Sensor details
in
> >>>>> Airflow DB and the worker process periodically queries task-instance
> >>>> table
> >>>>> to find sensor tasks; poke the metastore and update the task instance
> >>>> table
> >>>>> if it detects that certain partition or file created.
> >>>>>
> >>>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message