airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jasper Kahn <jak...@google.com.INVALID>
Subject Re: Initial Design for Supporting fine-grained Connection encryption [Feedback Requested]
Date Wed, 11 Jul 2018 22:06:43 GMT
Thanks for the responses, Ash and Shah. Some comments in-line below.

On 2018/07/07 18:13:09, Ash Berlin-Taylor <ash_airflowlist@firemirror.com>
wrote:
> A KMS is nothing strictly to do with Kubernetes - both Google Cloud and
AWS have a Key Management System. Having extra fields in the JSON relies on
extra being JSON which it might not be in every case.
>
> Comments in-line Jasper.
>
> The end goal of only allowing certain tasks to decrypt connections is a
good one, and sounds useful for a multi-team/multi-tenant Airflow. Even in
the case of not using containerised workers having KMS encryption could
make things a bit better (although provides no further security over the
current as any Airflow process would be able to impersonate another DAG
worker and get other secrets)

This was our opinion as well; the security is more useful in the
multi-team/tenant case than more "simple" Airflow setups.

>
> > On 7 Jul 2018, at 08:11, Shah Altaf <mendhak@gmail.com> wrote:
> >
> > Hi my feedback is - This feels far too k8s specific.  There would now be
> > extra fields in the connection form (and CLI) that are
> > hosting/implementation specific and not at all agnostic.  These could
> > probably go as additional params in the existing extra field's JSON.
That
> > would avoid any confusion about what those extra fields are and any
> > k8s-users that want that specific implementation have a place to put it.
> >
> >
> >
> >
> > On Fri, Jul 6, 2018 at 9:02 PM Jasper Kahn <jakahn@google.com.invalid>
> > wrote:
> >
> >> Hello,
> >>
> >> In support of adding fine-grained Connection encryption (Jira Issue:
> >> https://issues.apache.org/jira/browse/AIRFLOW-2062) I wanted to gather
> >> feedback on a proposed design, as it affects a few different Airflow
> >> components. A full design doc is coming next week.
> >>
> >> The end goal is to allow per-Connection encryption (as opposed the
global
> >> fernet key) to support providing containerized tasks with independent
> >> credentials to limit access, and to enable integration with Key
Management
> >> Systems.
> >>
> >>
> >> At a high level, Connection objects will be augmented with 2 additional
> >> fields: `KMS_type` and `KMS_extras`, which are modeled (somewhat)
after the
> >> existing `conn_type` and `extras` fields. Each connection can be
flagged as
> >> "independently encrypted", which then prompts the user to pick a KMS
(from
> >> a predefined list, like Connection type) and enter the relevant
> >> authentication and metadata that KMS requires to operate (mirroring how
> >> choosing a Connection type results in additional configuration).
>
> You don't mention where the KMS's will be defined for the user to pick
from? Will this be a connection itself?

My current design was going to mirror Connection types and Hooks by adding
a KMSClient directory (much like we have for hooks/operators now) and a
mapping in Connection like the one in get_hook (
https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L759).
This allows support for alternate KMSs, or new ones, in the future.

>
> Do we need totally new fields? Would a single extra flag/type field be
enough? In the case where kms_conn_id is set what would be in the current
extra_field?

extras would contain whatever it normally would if a KMS was not being
used. Some connections may carry other sensitive credentials in their
extras field, beyond what is stored in the default password field, which
should also be encrypted (see the example two sections down).

> Although unlikely might we want support for using two different KMS ids
concurrently? (therefore a `kms_type` isn't unique enough).

This is what the KMS_extras field provides; the KMS_type is just so Airflow
can select the appropriate client (AWS KMS, Google Cloud KMS, etc.). It's
just a simple string, much Connection Type is now (and unlike conn_id).

>
> Perhaps example values of the case where a connection is KMS encrypted,
and one where it isn't would help make this clearer.

Sure. Consider two connections: Connection H, a simple HTTP connection to
an unsecured resource (perhaps a public resource), and Connection D, a
connection to an internal database with appropriate credentials. D's
sensitive credentials may extend outside of the default password field
depending on what type of connection it is.

You may not care that H's details are visible to anyone with read access to
the Airflow configuration, but you would prefer D's details (including the
extras field) are kept encrypted, and decrypted only at time of use. To
minimize costs on your chosen KMS, you do not encrypt H, but you do encrypt
D.

>
> >>
> >> The credentials to authenticate with the KMS can either be manually
placed
> >> (like some key files for Connections are now) or, in the case of
> >> containerized workers, injected as a key file (through file system
mapping)
> >> or environment variable on a per-worker basis. These changes are
primarily
> >> in support of the second (containerized workers) model.
> >>
> >> When creating an encrypted Connection, Airflow will generate a
> >> cryptographic key (likely AES, possibly a separate fernet key) for that
> >> connection and encrypt the Connection fields. It will then encrypt
that key
> >> (K_conn) using the KMS.
>
> With the AWS KMS I think it's possible to let it entirely this process -
and you just give it the data to de/encrypt. Where it does handing off this
entire process should be achieved - the less crypto code we have to do
ourselves the better!

I agree, however AWS (for example) limits the payload-to-be-encrypted to
4Kb (
https://docs.aws.amazon.com/kms/latest/APIReference/API_Encrypt.html#API_Encrypt_RequestSyntax)
and the extras field alone can be 5Kb (
https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L640).
We could make multiple encryption passes (and decryption passes), although
the documentation for these systems indicate the intended use is to encrypt
the "true" encryption keys.

>
> >>
> >> KMS communication happens through KMSClients, which are implemented
very
> >> similarly to Connection types and Hooks, with a mapping from KMS_type
to a
> >> particular Client. New clients can be added by the community (as with
> >> hooks/Connection Types). The API for a KMSClient is simple: Encryption
and
> >> Decryption. The `encrypt` method would take in K_conn and the
configuration
> >> data, encrypt K_conn through the KMS, and return JSON to be stored in
the
> >> KMS_extra field. `decrypt` is passed this KMS_extra JSON, decrypts
K_conn
> >> though the KMS, and returns K_conn to be used to decrypt the Connection
> >> data. After both operations, K_conn is purged from memory.
>
> "configuration data" is encryption/decryption context that would be
passed to the AWS KMS APIs? (for example. I use the AWS one as that is the
one I'm most familiar with)

Yes, and any other Connection-specific data needed, such as which
keyring/key to use.

>
> >>
> >> Decryption would be implemented where the Connection is loaded from the
> >> database or environment. This makes the presence of per-Connection
> >> encryption transparent to any calling code, much like the fernet
encryption
> >> works now.
> >>
> >>
> >> As mentioned, all feedback and criticism is welcome to try to improve
this
> >> design. Thanks!
> >>
> >> Jasper Kahn
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message