airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bolke de Bruin <>
Subject Re: Kerberos and Airflow
Date Sat, 28 Jul 2018 21:07:10 GMT
Here: <>

Is a working rudimentary implementation that allows securing the connections (only LocalExecutor
at the moment)

* It enforces the use of “conn_id” instead of the mix that we have now
* A task if using “conn_id” has ‘auto-registered’ (which is a noop) its connections
* The scheduler reads the connection informations and serializes it to json (which should
be a different format, protobuf preferably)
* The scheduler then sends this info to the executor
* The executor puts this in the environment of the task (environment most likely not secure
enough for us)
* The BaseHook reads out this environment variable and does not need to touch the database

The example_http_operator works, I havent tested any other. To make it work I just adjusted
the hook and operator to use “conn_id” instead 
of the non standard http_conn_id.

Makes sense? 


* The BaseHook is adjusted to not connect to the database
> On 28 Jul 2018, at 17:50, Bolke de Bruin <> wrote:
> Well, I don’t think a hook (or task) should be obtain it by itself. It should be supplied.
> At the moment you start executing the task you cannot trust it anymore (ie. it is unmanaged

> / non airflow code).
> So we could change the basehook to understand supplied credentials and populate
> a hash with “conn_ids”. Hooks normally call BaseHook.get_connection anyway, so
> it shouldnt be too hard and should in principle not require changes to the hooks
> themselves if they are well behaved.
> B.
>> On 28 Jul 2018, at 17:41, Dan Davydov < <>>
>> *So basically in the scheduler we parse the dag. Either from the manifest
>> (new) or from smart parsing (probably harder, maybe some auto register?) we
>> know what connections and keytabs are available dag wide or per task.*
>> This is the hard part that I was curious about, for dynamically created
>> DAGs, e.g. those generated by reading tasks in a MySQL database or a json
>> file, there isn't a great way to do this.
>> I 100% agree with deprecating the connections table (at least for the
>> secure option). The main work there is rewriting all hooks to take
>> credentials from arbitrary data sources by allowing a customized
>> CredentialsReader class. Although hooks are technically private, I think a
>> lot of companies depend on them so the PMC should probably discuss if this
>> is an Airflow 2.0 change or not.
>> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin < <>>
>>> Sure. In general I consider keytabs as a part of connection information.
>>> Connections should be secured by sending the connection information a task
>>> needs as part of information the executor gets. A task should then not need
>>> access to the connection table in Airflow. Keytabs could then be send as
>>> part of the connection information (base64 encoded) and setup by the
>>> executor (this key) to be read only to the task it is launching.
>>> So basically in the scheduler we parse the dag. Either from the manifest
>>> (new) or from smart parsing (probably harder, maybe some auto register?) we
>>> know what connections and keytabs are available dag wide or per task.
>>> The credentials and connection information then are serialized into a
>>> protobuf message and send to the executor as part of the “queue” action.
>>> The worker then deserializes the information and makes it securely
>>> available to the task (which is quite hard btw).
>>> On that last bit making the info securely available might be storing it in
>>> the Linux KEYRING (supported by python keyring). Keytabs will be tough to
>>> do properly due to Java not properly supporting KEYRING and only files and
>>> these are hard to make secure (due to the possibility a process will list
>>> all files in /tmp and get credentials through that). Maybe storing the
>>> keytab with a password and having the password in the KEYRING might work.
>>> Something to find out.
>>> B.
>>> Verstuurd vanaf mijn iPad
>>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov <
>>> het volgende geschreven:
>>>> I'm curious if you had any ideas in terms of ideas to enable
>>> multi-tenancy
>>>> with respect to Kerberos in Airflow.
>>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <
>>> wrote:
>>>>> Cool. The doc will need some refinement as it isn't entirely accurate.
>>> In
>>>>> addition we need to separate between Airflow as a client of kerberized
>>>>> services (this is what is talked about in the astronomer doc) vs
>>>>> kerberizing airflow itself, which the API supports.
>>>>> In general to access kerberized services (airflow as a client) one needs
>>>>> to start the ticket renewer with a valid keytab. For the hooks it isn't
>>>>> always required to change the hook to support it. Hadoop cli tools often
>>>>> just pick it up as their client config is set to do so. Then another
>>> class
>>>>> is there for HTTP-like services which are accessed by urllib under the
>>>>> hood, these typically use SPNEGO. These often need to be adjusted as
>>>>> requires some urllib config. Finally, there are protocols which use SASL
>>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These require
>>> per
>>>>> protocol implementations.
>>>>> From the top of my head we support kerberos client side now with:
>>>>> * Spark
>>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs
>>>>> implementation)
>>>>> * Hive (not metastore afaik)
>>>>> Two things to remember:
>>>>> * If a job (ie. Spark job) will finish later than the maximum ticket
>>>>> lifetime you probably need to provide a keytab to said application.
>>>>> Otherwise you will get failures after the expiry.
>>>>> * A keytab (used by the renewer) are credentials (user and pass) so jobs
>>>>> are executed under the keytab in use at that moment
>>>>> * Securing keytab in multi tenancy airflow is a challenge. This also
>>> goes
>>>>> for securing connections. This we need to fix at some point. Solution
>>> for
>>>>> now seems to be no multi tenancy.
>>>>> Kerberos seems harder than it is btw. Still, we are sometimes moving
>>> away
>>>>> from it to OAUTH2 based authentication. This gets use closer to cloud
>>>>> standards (but we are on prem)
>>>>> B.
>>>>> Sent from my iPhone
>>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah < <>>
>>>>>> Hi Taylor
>>>>>> +1 on upstreaming this. It would be great if you can submit a pull
>>>>> request
>>>>>> to enhance the apache airflow docs.
>>>>>> thanks
>>>>>> Hitesh
>>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <
>>>>> wrote:
>>>>>>> While we're on the topic, I'd love any feedback from Bolke or
>>>>> who've
>>>>>>> used Kerberos with Airflow on this quick guide I put together
>>> yesterday.
>>>>>>> It's similar to what's in the Airflow docs but instead all on
one page
>>>>>>> and slightly
>>>>>>> expanded.
>>>>>>> (or web version <>)
>>>>>>> One thing I'd like to add is a minimal example of how to Kerberize
>>>>> hook.
>>>>>>> I'd be happy to upstream this as well if it's useful (maybe a
>>> Concepts >
>>>>>>> Additional Functionality > Kerberos page?)
>>>>>>> Best,
>>>>>>> Taylor
>>>>>>> *Taylor Edmiston*
>>>>>>> Blog <> | CV
>>>>>>> <> | LinkedIn
>>>>>>> <> | AngelList
>>>>>>> <> | Stack Overflow
>>>>>>> <>
>>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko
>>> <
>>>>>>> wrote:
>>>>>>>> Hi Ry,
>>>>>>>> You should ask Bolke de Bruin. He's really experienced with
>>>>> and
>>>>>>> he
>>>>>>>> did also the implementation for Airflow. Beside that he worked
>>> on
>>>>>>>> implementing Kerberos in Ambari. Just want to let you know.
>>>>>>>> Cheers, Fokko
>>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <>
>>>>>>>>> Hi everyone -
>>>>>>>>> We have several bigCo's who are considering using Airflow
>>> into
>>>>>>> its
>>>>>>>>> support for Kerberos.
>>>>>>>>> We're going to work on a proof-of-concept next week,
will likely
>>>>>>> record a
>>>>>>>>> screencast on it.
>>>>>>>>> For now, we're looking for any anecdotal information
>>>>> organizations
>>>>>>>> who
>>>>>>>>> are using Kerberos with Airflow, if anyone would be willing
to share
>>>>>>>> their
>>>>>>>>> experiences here, or reply to me personally, it would
be greatly
>>>>>>>>> appreciated!
>>>>>>>>> -Ry
>>>>>>>>> --
>>>>>>>>> *Ry Walker* | CEO, Astronomer <>
>>>>>>>> 513.417.2163 |
>>>>>>>>> @rywalker <> | LinkedIn
>>>>>>>>> <>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message