airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Toonstra <gtoons...@gmail.com>
Subject Re: Qs on airflow
Date Mon, 25 Sep 2017 06:00:52 GMT
Hi Larry,

The important thing to question is what kind of interface that other system
has. It is a little bit unusual in the sense that this DAG processes across
multiple days.
The potential issue I foresee here is that you don't mention a consistent
start date for the DAG and you expect this to run in an ad-hoc manner. Most
DAGs would process
"windows" of activity and you may get some issues with the time always
resetting to the defined scheduled start of the DAG.

What most DAGs would do to enable this is to have sensor tasks in the DAG.
A Hadoop job for example executes asynchronously from the originating
request.
You'd have a task to kick off the job, save the job id and then in another
task fetch the job id through xcom and continue polling using this sensor
task to verify
if the job finished (with either failed or finished). Then you allow the
DAG to continue or fail.


So this is where the job interface question comes into play. It depends
what you have available to verify the status of jobs and then you'd
probably write some
operators around that job interface. If these jobs never surpass a week,
then you could start defining a week interval, so you're never crossing
these boundaries.

Then look into for example the LatestOnlyOperator on how you can get the
left-most execution date (datetime) when the dagrun was started. There
should be other
ways to get the exact start/datetime of the task of your interest (when the
job was started), then figure out the total processing time you need. Then
run that sensor
every hour in a retry or something.


Alternatives are to look at what these tasks produce. For example, if you
drop files into S3 at the end of a process, look for those artifacts as a
means to
identify if the task succeeded or failed. Or perhaps even easier, write
control files in each workload that you can check for in airflow, which can
be easier than having to
implement a job control interface thingy.


You could also start the DAG and rely on 'retry' functionality in airflow
and then you calculate what interval size and how many retries you need to
get to 3 days in total,
after which that task fails.


Rgds,

Gerard


On Mon, Sep 25, 2017 at 3:41 AM, Wang, Larry <Larry.L.Wang@dell.com> wrote:

> Any updates on this?
>
>
>
> we basically want to build following DAG, and the group of BBTs in
> rectangle( start with snap should be triggered in daily basis)
>
>
>
>
>
>
>
> *From:* Wang, Larry
> *Sent:* Sunday, September 24, 2017 11:23 PM
> *To:* 'dev@airflow.apache.org' <dev@airflow.apache.org>
> *Subject:* Qs on airflow
>
>
>
> Hi experts,
>
>
>
> I am new to airflow and want to ask some questions of it to see if it is
> possible to leverage this tool in our daily works, please check them in
> below.
>
>
>
> 1st, I am implementing a system with 3 level workloads, the 1st workloads
> is triggered at day 1, and then the 2nd workload is triggered at day 3
> only if the 1st job could run long enough with 3 days and then the last
> workload will be trigger at day 5 if both previous workloads could continue
> running. Is this possible mapping to DAGs of the airflow?
>
>
>
> 2nd, Given the 1st workloads warming  up and  keep consuming certain
> system resources, a bunch operation will be kicked out in a queue, is it
> possible?
>
>
>
> Thanks and best regards
>
> Larry  Wang
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message