airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Froetscher <>
Subject Re: Launching Spark jobs on Amazon EMR
Date Mon, 12 Sep 2016 22:43:51 GMT
Hey Daniel,

We also run airflow on docker and use EMR.

I wrote a PR <> to
address EMR resources for airflow. It has been merged but has not been
released. The idea is that you can have your config as connections in the
DB and use the operators to interact with your cluster and the sensors to
wait for any action.

There are two good example dags in the PR.

We are using this currently for several jobs. Happy to answer any questions
you have about how we use it.



On Mon, Sep 12, 2016 at 2:10 PM, Daniel Siegmann <> wrote:

> Does anyone have experience using Airflow to launch Spark jobs on an Amazon
> EMR cluster?
> I have an Airflow cluster - separate from my EMR cluster - built as docker
> containers. I want to have Airflow submit jobs to an existing EMR cluster
> (though in the future I want to have Airflow start and stop clusters).
> I could copy the Hadoop configs from EMR to each of the Airflow nodes, but
> that's a pain. It'll be even more of a pain when I want to have Airflow
> create and destroy clusters. So I'd rather not take this approach.
> The only alternative I can think of is to use SSH to execute the
> spark-submit command on the EMR master node. This is simple enough, except
> Airflow will need the identity file to get access by SSH. Just copying the
> identity file to the Airflow nodes is problematic because it's in docker
> and I don't want this file in my Git repo.
> Is there anyone with a similar setup that would care to share their
> solution?
> --
> Daniel Siegmann
> Senior Software Engineer
> *SecurityScorecard Inc.*
> 214 W 29th Street, 5th Floor
> New York, NY 10001

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message