airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julie Chien (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AIRFLOW-3211) Airflow losing track of running GCP Dataproc jobs upon Airflow restart
Date Wed, 17 Oct 2018 18:27:00 GMT

     [ https://issues.apache.org/jira/browse/AIRFLOW-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julie Chien updated AIRFLOW-3211:
---------------------------------
    Description: 
If Airflow restarts (say, due to deployments, system updates, or regular machine restarts
such as the weekly restarts in GCP App Engine) while it's running a job on GCP Dataproc, it'll
lose track of that job, mark the task as failed, and eventually retry. However, the jobs may
still be running on Dataproc and maybe even finish successfully. So when Airflow retries
and reruns the job, the same job will run twice. This can result in issues like delayed
workflows, increased costs, and duplicate data. 
  
 To reproduce:

Setup:
 # Install Airflow.
 # Set up a GCP Project with the Dataproc API enabled
 # In the box that's running Airflow, {{pip install google-api-python-client }}{{oauth2client}}


 # Install this DAG in the Airflow instance: https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py Set
up the Airflow variables as instructed at the top of the file.
 # Start the Airflow scheduler and webserver if they're not running already. Kick off a run
of the above DAG through the Airflow UI. Wait for the cluster to spin up and the job to start
running on Dataproc.
 # While the job's running, kill the scheduler and webserver, and then start them back up.
 # Wait for Airflow to retry the task. Click on the cluster in Dataproc to observe that the
job will have been resubmitted, even though the first job is still running without error.
  
 At Etsy, we've customized the Dataproc operators to allow for the new Airflow task to pick
up where the old one left off upon Airflow restarts, and have been happily using our solution
for the past 6 months. I'd like to submit a PR to merge this change upstream.
  

  was:
If Airflow restarts (say, due to deployments, system updates, or regular machine restarts
such as the weekly restarts in GCP App Engine) while it's running a job on GCP Dataproc, it'll
lose track of that job, mark the task as failed, and eventually retry. However, the jobs may
still be running on Dataproc and maybe even finish successfully. So when Airflow retries
and reruns the job, the same job will run twice. This can result in issues like delayed
workflows, increased costs, and duplicate data. 
  
 To reproduce:
 # Install Airflow and set up a GCP project that has Dataproc enabled. Create a bucket in
the GCP project.
 # Install this DAG in the Airflow instance: [https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py|https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py.]  Set
up the Airflow variables as instructed in the comments at the top of the file.
 # Start the Airflow scheduler and webserver. Kick off a run of the above DAG through the
Airflow UI. Wait for the cluster to spin up and the job to start running on Dataproc.
 # Kill the scheduler and webserver, and then start them back up.
 # Wait for Airflow to retry the task. Click on the cluster in Dataproc to observe that the
job will have been resubmitted, even though the first job is still running without error.
  
 At Etsy, we've customized the Dataproc operators to allow for the new Airflow task to pick
up where the old one left off upon Airflow restarts, and have been happily using our solution
for the past 6 months. I'd like to submit a PR to merge this change upstream.
  


> Airflow losing track of running GCP Dataproc jobs upon Airflow restart
> ----------------------------------------------------------------------
>
>                 Key: AIRFLOW-3211
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3211
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: gcp
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Julie Chien
>            Assignee: Julie Chien
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.9.0, 1.10.0
>
>
> If Airflow restarts (say, due to deployments, system updates, or regular machine restarts
such as the weekly restarts in GCP App Engine) while it's running a job on GCP Dataproc, it'll
lose track of that job, mark the task as failed, and eventually retry. However, the jobs may
still be running on Dataproc and maybe even finish successfully. So when Airflow retries
and reruns the job, the same job will run twice. This can result in issues like delayed
workflows, increased costs, and duplicate data. 
>   
>  To reproduce:
> Setup:
>  # Install Airflow.
>  # Set up a GCP Project with the Dataproc API enabled
>  # In the box that's running Airflow, {{pip install google-api-python-client }}{{oauth2client}}
>  # Install this DAG in the Airflow instance: https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py Set
up the Airflow variables as instructed at the top of the file.
>  # Start the Airflow scheduler and webserver if they're not running already. Kick off
a run of the above DAG through the Airflow UI. Wait for the cluster to spin up and the job
to start running on Dataproc.
>  # While the job's running, kill the scheduler and webserver, and then start them back
up.
>  # Wait for Airflow to retry the task. Click on the cluster in Dataproc to observe that
the job will have been resubmitted, even though the first job is still running without error.
>   
>  At Etsy, we've customized the Dataproc operators to allow for the new Airflow task
to pick up where the old one left off upon Airflow restarts, and have been happily using
our solution for the past 6 months. I'd like to submit a PR to merge this change upstream.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message