airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Al Johri (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (AIRFLOW-247) EMR Hook, Operators, Sensor
Date Sun, 07 May 2017 23:57:04 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998681#comment-15998681
] 

Al Johri edited comment on AIRFLOW-247 at 5/7/17 11:56 PM:
-----------------------------------------------------------

I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find
anything here: https://airflow.incubator.apache.org/integration.html#aws

My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data
Pipeline?

Thanks!

EDIT: Found some information here: 

Spark, EMR:
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/example_dags/example_emr_job_flow_automatic_steps.py
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/example_dags/example_emr_job_flow_manual_steps.py
- (uses emr hooks, operators) https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd40eeee67_1_0
- (uses shells scripts to launch and terminate emr clusters) https://www.agari.com/automated-model-building-emr-spark-airflow/
- (use shell script to spark-submit on a local spark installation) https://blog.insightdatascience.com/scheduling-spark-jobs-with-airflow-4c66f3144660
- (installs spark on each airflow worker node and runs local spark jobs without use of spark
submit) https://medium.com/@calvertmg/airflow-integrating-with-apache-spark-50a7704dcebd
- (alternative mozilla implementation for emr spark job) https://github.com/mozilla/telemetry-airflow/blob/master/dags/operators/emr_spark_operator.py

EMR: 
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/emr_hook.py
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_create_job_flow_operator.py
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_add_steps_operator.py
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_terminate_job_flow_operator.py

Spark:
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py


was (Author: al.johri):
I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find
anything here: https://airflow.incubator.apache.org/integration.html#aws

My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data
Pipeline?

Thanks!

EDIT: Found some information here: 

Spark, EMR:
- (uses emr hooks, operators) https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd40eeee67_1_0
- (uses shells scripts to launch and terminate emr clusters) https://www.agari.com/automated-model-building-emr-spark-airflow/
- (use shell script to spark-submit on a local spark installation) https://blog.insightdatascience.com/scheduling-spark-jobs-with-airflow-4c66f3144660
- (installs spark on each airflow worker node and runs local spark jobs without use of spark
submit) https://medium.com/@calvertmg/airflow-integrating-with-apache-spark-50a7704dcebd
- (alternative mozilla implementation for emr spark job) https://github.com/mozilla/telemetry-airflow/blob/master/dags/operators/emr_spark_operator.py

EMR: 
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/emr_hook.py
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_create_job_flow_operator.py
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_add_steps_operator.py
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_terminate_job_flow_operator.py

Spark:
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py
- https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py

> EMR Hook, Operators, Sensor
> ---------------------------
>
>                 Key: AIRFLOW-247
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-247
>             Project: Apache Airflow
>          Issue Type: New Feature
>            Reporter: Rob Froetscher
>            Assignee: Rob Froetscher
>            Priority: Minor
>
> Substory of https://issues.apache.org/jira/browse/AIRFLOW-115. It would be nice to have
an EMR hook and operators.
> Hook to generally interact with EMR.
> Operators to:
> * setup and start a job flow
> * add steps to an existing jobflow 
> A sensor to:
> * monitor completion and status of EMR jobs



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message