airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Davydov <ddavy...@twitter.com.INVALID>
Subject Re: Kerberos and Airflow
Date Sun, 29 Jul 2018 03:36:36 GMT
This makes sense, and thanks for putting this together. I might pick this
up myself depending on if we can get the rest of the mutli-tenancy story
nailed down, but I still think the tricky part is figuring out how to allow
dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work with
Kerberos, curious what your thoughts are there. How would secrets be passed
securely in a multi-tenant Scheduler starting from parsing the DAGs up to
the executor sending them off?

On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin <bdbruin@gmail.com> wrote:

> Here:
>
> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>
>
> Is a working rudimentary implementation that allows securing the
> connections (only LocalExecutor at the moment)
>
> * It enforces the use of “conn_id” instead of the mix that we have now
> * A task if using “conn_id” has ‘auto-registered’ (which is a noop) its
> connections
> * The scheduler reads the connection informations and serializes it to
> json (which should be a different format, protobuf preferably)
> * The scheduler then sends this info to the executor
> * The executor puts this in the environment of the task (environment most
> likely not secure enough for us)
> * The BaseHook reads out this environment variable and does not need to
> touch the database
>
> The example_http_operator works, I havent tested any other. To make it
> work I just adjusted the hook and operator to use “conn_id” instead
> of the non standard http_conn_id.
>
> Makes sense?
>
> B.
>
> * The BaseHook is adjusted to not connect to the database
> > On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbruin@gmail.com> wrote:
> >
> > Well, I don’t think a hook (or task) should be obtain it by itself. It
> should be supplied.
> > At the moment you start executing the task you cannot trust it anymore
> (ie. it is unmanaged
> > / non airflow code).
> >
> > So we could change the basehook to understand supplied credentials and
> populate
> > a hash with “conn_ids”. Hooks normally call BaseHook.get_connection
> anyway, so
> > it shouldnt be too hard and should in principle not require changes to
> the hooks
> > themselves if they are well behaved.
> >
> > B.
> >
> >> On 28 Jul 2018, at 17:41, Dan Davydov <ddavydov@twitter.com.INVALID
> <mailto:ddavydov@twitter.com.INVALID>> wrote:
> >>
> >> *So basically in the scheduler we parse the dag. Either from the
> manifest
> >> (new) or from smart parsing (probably harder, maybe some auto
> register?) we
> >> know what connections and keytabs are available dag wide or per task.*
> >> This is the hard part that I was curious about, for dynamically created
> >> DAGs, e.g. those generated by reading tasks in a MySQL database or a
> json
> >> file, there isn't a great way to do this.
> >>
> >> I 100% agree with deprecating the connections table (at least for the
> >> secure option). The main work there is rewriting all hooks to take
> >> credentials from arbitrary data sources by allowing a customized
> >> CredentialsReader class. Although hooks are technically private, I
> think a
> >> lot of companies depend on them so the PMC should probably discuss if
> this
> >> is an Airflow 2.0 change or not.
> >>
> >> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbruin@gmail.com
> <mailto:bdbruin@gmail.com>> wrote:
> >>
> >>> Sure. In general I consider keytabs as a part of connection
> information.
> >>> Connections should be secured by sending the connection information a
> task
> >>> needs as part of information the executor gets. A task should then not
> need
> >>> access to the connection table in Airflow. Keytabs could then be send
> as
> >>> part of the connection information (base64 encoded) and setup by the
> >>> executor (this key) to be read only to the task it is launching.
> >>>
> >>> So basically in the scheduler we parse the dag. Either from the
> manifest
> >>> (new) or from smart parsing (probably harder, maybe some auto
> register?) we
> >>> know what connections and keytabs are available dag wide or per task.
> >>>
> >>> The credentials and connection information then are serialized into a
> >>> protobuf message and send to the executor as part of the “queue”
> action.
> >>> The worker then deserializes the information and makes it securely
> >>> available to the task (which is quite hard btw).
> >>>
> >>> On that last bit making the info securely available might be storing
> it in
> >>> the Linux KEYRING (supported by python keyring). Keytabs will be tough
> to
> >>> do properly due to Java not properly supporting KEYRING and only files
> and
> >>> these are hard to make secure (due to the possibility a process will
> list
> >>> all files in /tmp and get credentials through that). Maybe storing the
> >>> keytab with a password and having the password in the KEYRING might
> work.
> >>> Something to find out.
> >>>
> >>> B.
> >>>
> >>> Verstuurd vanaf mijn iPad
> >>>
> >>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov
> <ddavydov@twitter.com.INVALID <mailto:ddavydov@twitter.com.INVALID>>
> >>> het volgende geschreven:
> >>>>
> >>>> I'm curious if you had any ideas in terms of ideas to enable
> >>> multi-tenancy
> >>>> with respect to Kerberos in Airflow.
> >>>>
> >>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <bdbruin@gmail.com
> <mailto:bdbruin@gmail.com>>
> >>> wrote:
> >>>>>
> >>>>> Cool. The doc will need some refinement as it isn't entirely
> accurate.
> >>> In
> >>>>> addition we need to separate between Airflow as a client of
> kerberized
> >>>>> services (this is what is talked about in the astronomer doc) vs
> >>>>> kerberizing airflow itself, which the API supports.
> >>>>>
> >>>>> In general to access kerberized services (airflow as a client) one
> needs
> >>>>> to start the ticket renewer with a valid keytab. For the hooks it
> isn't
> >>>>> always required to change the hook to support it. Hadoop cli tools
> often
> >>>>> just pick it up as their client config is set to do so. Then another
> >>> class
> >>>>> is there for HTTP-like services which are accessed by urllib under
> the
> >>>>> hood, these typically use SPNEGO. These often need to be adjusted
as
> it
> >>>>> requires some urllib config. Finally, there are protocols which
use
> SASL
> >>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These
> require
> >>> per
> >>>>> protocol implementations.
> >>>>>
> >>>>> From the top of my head we support kerberos client side now with:
> >>>>>
> >>>>> * Spark
> >>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs
> >>>>> implementation)
> >>>>> * Hive (not metastore afaik)
> >>>>>
> >>>>> Two things to remember:
> >>>>>
> >>>>> * If a job (ie. Spark job) will finish later than the maximum ticket
> >>>>> lifetime you probably need to provide a keytab to said application.
> >>>>> Otherwise you will get failures after the expiry.
> >>>>> * A keytab (used by the renewer) are credentials (user and pass)
so
> jobs
> >>>>> are executed under the keytab in use at that moment
> >>>>> * Securing keytab in multi tenancy airflow is a challenge. This
also
> >>> goes
> >>>>> for securing connections. This we need to fix at some point. Solution
> >>> for
> >>>>> now seems to be no multi tenancy.
> >>>>>
> >>>>> Kerberos seems harder than it is btw. Still, we are sometimes moving
> >>> away
> >>>>> from it to OAUTH2 based authentication. This gets use closer to
cloud
> >>>>> standards (but we are on prem)
> >>>>>
> >>>>> B.
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hitesh@apache.org
<mailto:
> hitesh@apache.org>> wrote:
> >>>>>>
> >>>>>> Hi Taylor
> >>>>>>
> >>>>>> +1 on upstreaming this. It would be great if you can submit
a pull
> >>>>> request
> >>>>>> to enhance the apache airflow docs.
> >>>>>>
> >>>>>> thanks
> >>>>>> Hitesh
> >>>>>>
> >>>>>>
> >>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <
> tedmiston@gmail.com <mailto:tedmiston@gmail.com>>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> While we're on the topic, I'd love any feedback from Bolke
or
> others
> >>>>> who've
> >>>>>>> used Kerberos with Airflow on this quick guide I put together
> >>> yesterday.
> >>>>>>> It's similar to what's in the Airflow docs but instead all
on one
> page
> >>>>>>> and slightly
> >>>>>>> expanded.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>>>>> (or web version <https://www.astronomer.io/guides/kerberos/>)
> >>>>>>>
> >>>>>>> One thing I'd like to add is a minimal example of how to
Kerberize
> a
> >>>>> hook.
> >>>>>>>
> >>>>>>> I'd be happy to upstream this as well if it's useful (maybe
a
> >>> Concepts >
> >>>>>>> Additional Functionality > Kerberos page?)
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Taylor
> >>>>>>>
> >>>>>>>
> >>>>>>> *Taylor Edmiston*
> >>>>>>> Blog <https://blog.tedmiston.com/> | CV
> >>>>>>> <https://stackoverflow.com/cv/taylor> | LinkedIn
> >>>>>>> <https://www.linkedin.com/in/tedmiston/> | AngelList
> >>>>>>> <https://angel.co/taylor> | Stack Overflow
> >>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko
> >>> <fokko@driesprong.frl
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Ry,
> >>>>>>>>
> >>>>>>>> You should ask Bolke de Bruin. He's really experienced
with
> Kerberos
> >>>>> and
> >>>>>>> he
> >>>>>>>> did also the implementation for Airflow. Beside that
he worked
> also
> >>> on
> >>>>>>>> implementing Kerberos in Ambari. Just want to let you
know.
> >>>>>>>>
> >>>>>>>> Cheers, Fokko
> >>>>>>>>
> >>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <ry@astronomer.io>
> >>>>>>>>
> >>>>>>>>> Hi everyone -
> >>>>>>>>>
> >>>>>>>>> We have several bigCo's who are considering using
Airflow asking
> >>> into
> >>>>>>> its
> >>>>>>>>> support for Kerberos.
> >>>>>>>>>
> >>>>>>>>> We're going to work on a proof-of-concept next week,
will likely
> >>>>>>> record a
> >>>>>>>>> screencast on it.
> >>>>>>>>>
> >>>>>>>>> For now, we're looking for any anecdotal information
from
> >>>>> organizations
> >>>>>>>> who
> >>>>>>>>> are using Kerberos with Airflow, if anyone would
be willing to
> share
> >>>>>>>> their
> >>>>>>>>> experiences here, or reply to me personally, it
would be greatly
> >>>>>>>>> appreciated!
> >>>>>>>>>
> >>>>>>>>> -Ry
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>>
> >>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/>
|
> >>>>>>>> 513.417.2163 |
> >>>>>>>>> @rywalker <http://twitter.com/rywalker> |
LinkedIn
> >>>>>>>>> <http://www.linkedin.com/in/rywalker>
> >>>>>>>
> >>>>>
> >>>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message