airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ash Berlin-Taylor <...@apache.org>
Subject Re: [Discuss] Airflow sensor optimization
Date Thu, 07 Mar 2019 10:36:11 GMT
Rescheduling is of massive use for a DAG where we are waiting for a weekly S3 file delivery
from a third party supplier with _massive_ variance in the delivery time. It'll appear at
some point between Thursday AM and Sunday evening. Not having an executor slot tied up with
the S3KeySensor is great for this.

-ash

> On 6 Mar 2019, at 21:51, Alex Guziel <alex.guziel@airbnb.com.INVALID> wrote:
> 
> Smart sensor seems like a good idea, but I wonder how much performance will
> be improved in practice. And of course, one must think about sharding and
> such.
> 
> I'm not sure how helpful rescheduling sensors is, since it will add
> scheduler and DB load seemingly, which is already a bottleneck.
> 
> On Wed, Mar 6, 2019 at 12:43 PM Yingbo Wang <ybwang@gmail.com> wrote:
> 
>> I would still like to get some feedback on the batch sensor/smart sensor
>> idea after viewing the sensor rescheduling PR. Since the reschedule mode
>> does not reduce the number of worker processes for sensor. The batch sensor
>> idea is proposed for this purpose and should work well with reschedule
>> mode.
>> 
>> On Wed, Mar 6, 2019 at 11:30 AM Yingbo Wang <ybwang@gmail.com> wrote:
>> 
>>> Wow, Great work from Seelmann! Thanks Fokko for letting us know it. We
>> are
>>> super happy to have this feature.
>>> 
>>> On Wed, Mar 6, 2019 at 11:24 AM Driesprong, Fokko <fokko@driesprong.frl>
>>> wrote:
>>> 
>>>> Thanks for bringing this up. I've added a comment on the Wiki:
>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization
>>>> 
>>>> Have you looked into the work by Seelmann? Recently he introduced the
>>>> ability to reschedule sensors. When rescheduling, the slot will be given
>>>> back to the scheduler after a poke operation. Therefore the slot won't
>> be
>>>> occupied all the time. The details are in the PR
>>>> https://github.com/apache/airflow/pull/3596
>>>> 
>>>> I would propose to make this the default behavior in Airflow 2.0.
>>>> 
>>>> Cheers, Fokko
>>>> 
>>>> Op wo 6 mrt. 2019 om 15:32 schreef Yingbo Wang <ybwang@gmail.com>:
>>>> 
>>>>> hi,
>>>>> 
>>>>> I would like to open an AIP for Airflow sensor optimization.
>>>>> 
>>>>> 
>>>>> *Motivation*:
>>>>> 
>>>>> Low efficiency in Airflow Sensor Implementation
>>>>> 
>>>>> Sensors are a special kind of operator that will keep running until a
>>>>> certain criterion is met. Examples include a specific file landing in
>>>> HDFS
>>>>> or S3, a partition appearing in Hive, or a specific time of the day.
>>>>> Sensors are derived from BaseSensorOperator and run a poke method at
a
>>>>> specified poke_interval until it returns True.
>>>>> 
>>>>> The reason that the sensor tasks are inefficient is because in current
>>>>> design, we sprawn a separate worker process for each partition sensor.
>>>> This
>>>>> worker might last a long time, until the target partition is
>>>> available.  In
>>>>> the case where there are many sensor tasks that need to run within
>>>> certain
>>>>> time limits, we have to allocate a lot of resources to have enough
>>>> workers
>>>>> for the sensor tasks.
>>>>> 
>>>>> *Idea:*
>>>>> 
>>>>> We propose two approaches that could address this issues, batch-sensor
>>>>> and smart-sensor.
>>>>> 
>>>>> 
>>>>> 
>>>>> Batch-sensor
>>>>> 
>>>>> The basic idea of batch-sensor is to batch process sensor tasks to
>> save
>>>>> resources. During running, a batch-sensor will take N partition sensor
>>>>> requests as the input and poke those N partitions periodically. If the
>>>>> batch-sensor finds that the criteria of some sensor task is met, the
>>>>> batch-sensor will update the database about this sensor tasks.
>>>>> 
>>>>> 
>>>>> To do this, we need to create a sensor basic class called ‘batchable’
>>>> and
>>>>> make all sensors inherit from this basic class. We also need to change
>>>> the
>>>>> behavior of schedule regarding a batchable sensor tasks. The schedule
>>>> will
>>>>> find as many as possible batchable sensor tasks and run those tasks
>> in a
>>>>> batch.
>>>>> 
>>>>> 
>>>>> Smart-sensor
>>>>> 
>>>>> Smart-sensor is an improvement on top of batch-sensor.
>>>>> 
>>>>> The idea of smart-sensor is that the worker process of smart-sensor
>> will
>>>>> run like a service. To do this, we need to persist Sensor details in
>>>>> Airflow DB and the worker process periodically queries task-instance
>>>> table
>>>>> to find sensor tasks; poke the metastore and update the task instance
>>>> table
>>>>> if it detects that certain partition or file created.
>>>>> 
>>>> 
>>> 
>> 


Mime
View raw message