From dev-return-5557-archive-asf-public=cust-asf.ponee.io@airflow.incubator.apache.org Thu Jul 12 00:07:00 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id E5A0518062A for ; Thu, 12 Jul 2018 00:06:59 +0200 (CEST) Received: (qmail 78359 invoked by uid 500); 11 Jul 2018 22:06:58 -0000 Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list dev@airflow.incubator.apache.org Received: (qmail 78345 invoked by uid 99); 11 Jul 2018 22:06:58 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Jul 2018 22:06:58 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C2C23CA9BB for ; Wed, 11 Jul 2018 22:06:57 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -13.631 X-Spam-Level: X-Spam-Status: No, score=-13.631 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=google.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id G9unS8ehMhBk for ; Wed, 11 Jul 2018 22:06:56 +0000 (UTC) Received: from mail-vk0-f68.google.com (mail-vk0-f68.google.com [209.85.213.68]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 7911F5F1B9 for ; Wed, 11 Jul 2018 22:06:55 +0000 (UTC) Received: by mail-vk0-f68.google.com with SMTP id b77-v6so15281654vkb.5 for ; Wed, 11 Jul 2018 15:06:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=8rWiYaeiLqvtUyZCPJDJG7KE3qD1EFr09N1HwuRGUkY=; b=KAI94D67BmgTOrGupCNd74xXYHEYys1wPkHBy8Biy9Igq8Ay/QsQfYdUUcsanckMYS FX3KXWjc3dG4B6HEsbDKliVMeFJ+rED6l6SdfbyzJUUVwdHj/ZdaakObHwoUG3y6HuOS Yt9pAWqGF8kT8Cm7avcpLIWVgK4Ghdvz0WgyRoi0P41meC1BfUZ6tjA8wZ8heQBA3gTv qdBR76ZjREC52gsV8PrWDY0VlnRaQIzvGBeC6PkPIYp4AYn5lSPLekrngZINAAwxfZxx Tzzp8xwvGBX+bsd71pBYMANJpnk9P5ylYWgjuXLGyMEjW6h31Cy/51Zt9m8kzMHvyI12 TUSg== X-Gm-Message-State: AOUpUlGkOqSEDbeI9H0IZgi2oj4qB/Lkt13GIN9zfv0UABR2RzjKyjQZ iSp3GNw7XA3IZpff+mBWpZYdBf9/q5F/TUgVyL/zX1vEig== X-Google-Smtp-Source: AAOMgpeQJCC2r628nGHP9Q+AETKMpg0GaY+g30orm6fRQTj8TyAOoLaytfjmdRWDDcc7wGyUSagBYJfvLwxd/edqUVM= X-Received: by 2002:a1f:88:: with SMTP id 130-v6mr279605vka.154.1531346813896; Wed, 11 Jul 2018 15:06:53 -0700 (PDT) MIME-Version: 1.0 From: Jasper Kahn Date: Wed, 11 Jul 2018 15:06:43 -0700 Message-ID: Subject: Re: Initial Design for Supporting fine-grained Connection encryption [Feedback Requested] To: dev@airflow.apache.org Content-Type: multipart/alternative; boundary="000000000000df58f10570c076cf" --000000000000df58f10570c076cf Content-Type: text/plain; charset="UTF-8" Thanks for the responses, Ash and Shah. Some comments in-line below. On 2018/07/07 18:13:09, Ash Berlin-Taylor wrote: > A KMS is nothing strictly to do with Kubernetes - both Google Cloud and AWS have a Key Management System. Having extra fields in the JSON relies on extra being JSON which it might not be in every case. > > Comments in-line Jasper. > > The end goal of only allowing certain tasks to decrypt connections is a good one, and sounds useful for a multi-team/multi-tenant Airflow. Even in the case of not using containerised workers having KMS encryption could make things a bit better (although provides no further security over the current as any Airflow process would be able to impersonate another DAG worker and get other secrets) This was our opinion as well; the security is more useful in the multi-team/tenant case than more "simple" Airflow setups. > > > On 7 Jul 2018, at 08:11, Shah Altaf wrote: > > > > Hi my feedback is - This feels far too k8s specific. There would now be > > extra fields in the connection form (and CLI) that are > > hosting/implementation specific and not at all agnostic. These could > > probably go as additional params in the existing extra field's JSON. That > > would avoid any confusion about what those extra fields are and any > > k8s-users that want that specific implementation have a place to put it. > > > > > > > > > > On Fri, Jul 6, 2018 at 9:02 PM Jasper Kahn > > wrote: > > > >> Hello, > >> > >> In support of adding fine-grained Connection encryption (Jira Issue: > >> https://issues.apache.org/jira/browse/AIRFLOW-2062) I wanted to gather > >> feedback on a proposed design, as it affects a few different Airflow > >> components. A full design doc is coming next week. > >> > >> The end goal is to allow per-Connection encryption (as opposed the global > >> fernet key) to support providing containerized tasks with independent > >> credentials to limit access, and to enable integration with Key Management > >> Systems. > >> > >> > >> At a high level, Connection objects will be augmented with 2 additional > >> fields: `KMS_type` and `KMS_extras`, which are modeled (somewhat) after the > >> existing `conn_type` and `extras` fields. Each connection can be flagged as > >> "independently encrypted", which then prompts the user to pick a KMS (from > >> a predefined list, like Connection type) and enter the relevant > >> authentication and metadata that KMS requires to operate (mirroring how > >> choosing a Connection type results in additional configuration). > > You don't mention where the KMS's will be defined for the user to pick from? Will this be a connection itself? My current design was going to mirror Connection types and Hooks by adding a KMSClient directory (much like we have for hooks/operators now) and a mapping in Connection like the one in get_hook ( https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L759). This allows support for alternate KMSs, or new ones, in the future. > > Do we need totally new fields? Would a single extra flag/type field be enough? In the case where kms_conn_id is set what would be in the current extra_field? extras would contain whatever it normally would if a KMS was not being used. Some connections may carry other sensitive credentials in their extras field, beyond what is stored in the default password field, which should also be encrypted (see the example two sections down). > Although unlikely might we want support for using two different KMS ids concurrently? (therefore a `kms_type` isn't unique enough). This is what the KMS_extras field provides; the KMS_type is just so Airflow can select the appropriate client (AWS KMS, Google Cloud KMS, etc.). It's just a simple string, much Connection Type is now (and unlike conn_id). > > Perhaps example values of the case where a connection is KMS encrypted, and one where it isn't would help make this clearer. Sure. Consider two connections: Connection H, a simple HTTP connection to an unsecured resource (perhaps a public resource), and Connection D, a connection to an internal database with appropriate credentials. D's sensitive credentials may extend outside of the default password field depending on what type of connection it is. You may not care that H's details are visible to anyone with read access to the Airflow configuration, but you would prefer D's details (including the extras field) are kept encrypted, and decrypted only at time of use. To minimize costs on your chosen KMS, you do not encrypt H, but you do encrypt D. > > >> > >> The credentials to authenticate with the KMS can either be manually placed > >> (like some key files for Connections are now) or, in the case of > >> containerized workers, injected as a key file (through file system mapping) > >> or environment variable on a per-worker basis. These changes are primarily > >> in support of the second (containerized workers) model. > >> > >> When creating an encrypted Connection, Airflow will generate a > >> cryptographic key (likely AES, possibly a separate fernet key) for that > >> connection and encrypt the Connection fields. It will then encrypt that key > >> (K_conn) using the KMS. > > With the AWS KMS I think it's possible to let it entirely this process - and you just give it the data to de/encrypt. Where it does handing off this entire process should be achieved - the less crypto code we have to do ourselves the better! I agree, however AWS (for example) limits the payload-to-be-encrypted to 4Kb ( https://docs.aws.amazon.com/kms/latest/APIReference/API_Encrypt.html#API_Encrypt_RequestSyntax) and the extras field alone can be 5Kb ( https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L640). We could make multiple encryption passes (and decryption passes), although the documentation for these systems indicate the intended use is to encrypt the "true" encryption keys. > > >> > >> KMS communication happens through KMSClients, which are implemented very > >> similarly to Connection types and Hooks, with a mapping from KMS_type to a > >> particular Client. New clients can be added by the community (as with > >> hooks/Connection Types). The API for a KMSClient is simple: Encryption and > >> Decryption. The `encrypt` method would take in K_conn and the configuration > >> data, encrypt K_conn through the KMS, and return JSON to be stored in the > >> KMS_extra field. `decrypt` is passed this KMS_extra JSON, decrypts K_conn > >> though the KMS, and returns K_conn to be used to decrypt the Connection > >> data. After both operations, K_conn is purged from memory. > > "configuration data" is encryption/decryption context that would be passed to the AWS KMS APIs? (for example. I use the AWS one as that is the one I'm most familiar with) Yes, and any other Connection-specific data needed, such as which keyring/key to use. > > >> > >> Decryption would be implemented where the Connection is loaded from the > >> database or environment. This makes the presence of per-Connection > >> encryption transparent to any calling code, much like the fernet encryption > >> works now. > >> > >> > >> As mentioned, all feedback and criticism is welcome to try to improve this > >> design. Thanks! > >> > >> Jasper Kahn > >> > > --000000000000df58f10570c076cf--