airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bolke de Bruin <bdbr...@gmail.com>
Subject Re: Kerberos and Airflow
Date Sat, 28 Jul 2018 21:07:10 GMT
Here:

https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>

Is a working rudimentary implementation that allows securing the connections (only LocalExecutor
at the moment)

* It enforces the use of “conn_id” instead of the mix that we have now
* A task if using “conn_id” has ‘auto-registered’ (which is a noop) its connections
* The scheduler reads the connection informations and serializes it to json (which should
be a different format, protobuf preferably)
* The scheduler then sends this info to the executor
* The executor puts this in the environment of the task (environment most likely not secure
enough for us)
* The BaseHook reads out this environment variable and does not need to touch the database

The example_http_operator works, I havent tested any other. To make it work I just adjusted
the hook and operator to use “conn_id” instead 
of the non standard http_conn_id.

Makes sense? 

B.

* The BaseHook is adjusted to not connect to the database
> On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbruin@gmail.com> wrote:
> 
> Well, I don’t think a hook (or task) should be obtain it by itself. It should be supplied.
> At the moment you start executing the task you cannot trust it anymore (ie. it is unmanaged

> / non airflow code).
> 
> So we could change the basehook to understand supplied credentials and populate
> a hash with “conn_ids”. Hooks normally call BaseHook.get_connection anyway, so
> it shouldnt be too hard and should in principle not require changes to the hooks
> themselves if they are well behaved.
> 
> B.
> 
>> On 28 Jul 2018, at 17:41, Dan Davydov <ddavydov@twitter.com.INVALID <mailto:ddavydov@twitter.com.INVALID>>
wrote:
>> 
>> *So basically in the scheduler we parse the dag. Either from the manifest
>> (new) or from smart parsing (probably harder, maybe some auto register?) we
>> know what connections and keytabs are available dag wide or per task.*
>> This is the hard part that I was curious about, for dynamically created
>> DAGs, e.g. those generated by reading tasks in a MySQL database or a json
>> file, there isn't a great way to do this.
>> 
>> I 100% agree with deprecating the connections table (at least for the
>> secure option). The main work there is rewriting all hooks to take
>> credentials from arbitrary data sources by allowing a customized
>> CredentialsReader class. Although hooks are technically private, I think a
>> lot of companies depend on them so the PMC should probably discuss if this
>> is an Airflow 2.0 change or not.
>> 
>> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbruin@gmail.com <mailto:bdbruin@gmail.com>>
wrote:
>> 
>>> Sure. In general I consider keytabs as a part of connection information.
>>> Connections should be secured by sending the connection information a task
>>> needs as part of information the executor gets. A task should then not need
>>> access to the connection table in Airflow. Keytabs could then be send as
>>> part of the connection information (base64 encoded) and setup by the
>>> executor (this key) to be read only to the task it is launching.
>>> 
>>> So basically in the scheduler we parse the dag. Either from the manifest
>>> (new) or from smart parsing (probably harder, maybe some auto register?) we
>>> know what connections and keytabs are available dag wide or per task.
>>> 
>>> The credentials and connection information then are serialized into a
>>> protobuf message and send to the executor as part of the “queue” action.
>>> The worker then deserializes the information and makes it securely
>>> available to the task (which is quite hard btw).
>>> 
>>> On that last bit making the info securely available might be storing it in
>>> the Linux KEYRING (supported by python keyring). Keytabs will be tough to
>>> do properly due to Java not properly supporting KEYRING and only files and
>>> these are hard to make secure (due to the possibility a process will list
>>> all files in /tmp and get credentials through that). Maybe storing the
>>> keytab with a password and having the password in the KEYRING might work.
>>> Something to find out.
>>> 
>>> B.
>>> 
>>> Verstuurd vanaf mijn iPad
>>> 
>>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov <ddavydov@twitter.com.INVALID
<mailto:ddavydov@twitter.com.INVALID>>
>>> het volgende geschreven:
>>>> 
>>>> I'm curious if you had any ideas in terms of ideas to enable
>>> multi-tenancy
>>>> with respect to Kerberos in Airflow.
>>>> 
>>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <bdbruin@gmail.com
<mailto:bdbruin@gmail.com>>
>>> wrote:
>>>>> 
>>>>> Cool. The doc will need some refinement as it isn't entirely accurate.
>>> In
>>>>> addition we need to separate between Airflow as a client of kerberized
>>>>> services (this is what is talked about in the astronomer doc) vs
>>>>> kerberizing airflow itself, which the API supports.
>>>>> 
>>>>> In general to access kerberized services (airflow as a client) one needs
>>>>> to start the ticket renewer with a valid keytab. For the hooks it isn't
>>>>> always required to change the hook to support it. Hadoop cli tools often
>>>>> just pick it up as their client config is set to do so. Then another
>>> class
>>>>> is there for HTTP-like services which are accessed by urllib under the
>>>>> hood, these typically use SPNEGO. These often need to be adjusted as
it
>>>>> requires some urllib config. Finally, there are protocols which use SASL
>>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These require
>>> per
>>>>> protocol implementations.
>>>>> 
>>>>> From the top of my head we support kerberos client side now with:
>>>>> 
>>>>> * Spark
>>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs
>>>>> implementation)
>>>>> * Hive (not metastore afaik)
>>>>> 
>>>>> Two things to remember:
>>>>> 
>>>>> * If a job (ie. Spark job) will finish later than the maximum ticket
>>>>> lifetime you probably need to provide a keytab to said application.
>>>>> Otherwise you will get failures after the expiry.
>>>>> * A keytab (used by the renewer) are credentials (user and pass) so jobs
>>>>> are executed under the keytab in use at that moment
>>>>> * Securing keytab in multi tenancy airflow is a challenge. This also
>>> goes
>>>>> for securing connections. This we need to fix at some point. Solution
>>> for
>>>>> now seems to be no multi tenancy.
>>>>> 
>>>>> Kerberos seems harder than it is btw. Still, we are sometimes moving
>>> away
>>>>> from it to OAUTH2 based authentication. This gets use closer to cloud
>>>>> standards (but we are on prem)
>>>>> 
>>>>> B.
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hitesh@apache.org <mailto:hitesh@apache.org>>
wrote:
>>>>>> 
>>>>>> Hi Taylor
>>>>>> 
>>>>>> +1 on upstreaming this. It would be great if you can submit a pull
>>>>> request
>>>>>> to enhance the apache airflow docs.
>>>>>> 
>>>>>> thanks
>>>>>> Hitesh
>>>>>> 
>>>>>> 
>>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <tedmiston@gmail.com
<mailto:tedmiston@gmail.com>>
>>>>> wrote:
>>>>>>> 
>>>>>>> While we're on the topic, I'd love any feedback from Bolke or
others
>>>>> who've
>>>>>>> used Kerberos with Airflow on this quick guide I put together
>>> yesterday.
>>>>>>> It's similar to what's in the Airflow docs but instead all on
one page
>>>>>>> and slightly
>>>>>>> expanded.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
<https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>>>>>> (or web version <https://www.astronomer.io/guides/kerberos/>)
>>>>>>> 
>>>>>>> One thing I'd like to add is a minimal example of how to Kerberize
a
>>>>> hook.
>>>>>>> 
>>>>>>> I'd be happy to upstream this as well if it's useful (maybe a
>>> Concepts >
>>>>>>> Additional Functionality > Kerberos page?)
>>>>>>> 
>>>>>>> Best,
>>>>>>> Taylor
>>>>>>> 
>>>>>>> 
>>>>>>> *Taylor Edmiston*
>>>>>>> Blog <https://blog.tedmiston.com/> | CV
>>>>>>> <https://stackoverflow.com/cv/taylor> | LinkedIn
>>>>>>> <https://www.linkedin.com/in/tedmiston/> | AngelList
>>>>>>> <https://angel.co/taylor> | Stack Overflow
>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston>
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko
>>> <fokko@driesprong.frl
>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Ry,
>>>>>>>> 
>>>>>>>> You should ask Bolke de Bruin. He's really experienced with
Kerberos
>>>>> and
>>>>>>> he
>>>>>>>> did also the implementation for Airflow. Beside that he worked
also
>>> on
>>>>>>>> implementing Kerberos in Ambari. Just want to let you know.
>>>>>>>> 
>>>>>>>> Cheers, Fokko
>>>>>>>> 
>>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <ry@astronomer.io>
>>>>>>>> 
>>>>>>>>> Hi everyone -
>>>>>>>>> 
>>>>>>>>> We have several bigCo's who are considering using Airflow
asking
>>> into
>>>>>>> its
>>>>>>>>> support for Kerberos.
>>>>>>>>> 
>>>>>>>>> We're going to work on a proof-of-concept next week,
will likely
>>>>>>> record a
>>>>>>>>> screencast on it.
>>>>>>>>> 
>>>>>>>>> For now, we're looking for any anecdotal information
from
>>>>> organizations
>>>>>>>> who
>>>>>>>>> are using Kerberos with Airflow, if anyone would be willing
to share
>>>>>>>> their
>>>>>>>>> experiences here, or reply to me personally, it would
be greatly
>>>>>>>>> appreciated!
>>>>>>>>> 
>>>>>>>>> -Ry
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> 
>>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/>
|
>>>>>>>> 513.417.2163 |
>>>>>>>>> @rywalker <http://twitter.com/rywalker> | LinkedIn
>>>>>>>>> <http://www.linkedin.com/in/rywalker>
>>>>>>> 
>>>>> 
>>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message