impala-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tarmstr...@apache.org
Subject [2/4] impala git commit: IMPALA-6067: Enable S3 access via IAM roles for EC2 VMs
Date Sat, 09 Dec 2017 06:13:55 GMT
IMPALA-6067: Enable S3 access via IAM roles for EC2 VMs

For some time Impala in a production environment has been able
to access data stored in Amazon S3 buckets using credentials specified
in a number of ways:
- storing Amazon access keys in environment variables or
  in core-site.xml.
- using proprietary management tools to store Amazon access keys
  securely
- using Amazon IAM roles bound to VMs running in EC2.

The development minicluster environment used the first approach,
which risked leaking these keys.

This change enables Impala builds to use IAM
roles to access S3 buckets when running on an Amazon EC2 virtual
machine. The changes mainly ensure that environment variables carrying
the traditional AWS credentials do not conflict with credentials supplied
by the IAM role attached to the VM instance.

IAM role based credentials are accessible through the EC2
instance-property mechanism; for further details see Amazon's docs at
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials

The change also removes the remaining references to the s3n: provider.
In the FE tests all URIs referring to s3n: are replaced with their
s3a: equivalents, except for a single negative test in
AnalyzeStmtsTest.java, which is removed.

In addition to the code changes, the s3n: and s3a: credential properties
are also removed from core-site.xml.tmpl. The s3a: provider can pick up
AWS S3 credentials from environment variables or IAM properties bound
to the VM instance, which is a more flexible approach.

As environment variables have precedence over IAM roles, care must be
taken when managing the canonical environment variables carrying
AWS credentials. There are two requirements to be reconciled:
1. The FE tests have code that examines s3a: URIs; this code needs
   existing, but not necessarily valid AWS credentials.
2. When the Impala test suite is executed on an EC2 VM, AWS credentials
   can be supplied via IAM roles. These credentials can be used only
   if the AWS_* environment variables are unset (do not exist).

The tradeoff is managed following these rules:
1. When AWS_* environment variables are set before invoking the
   Impala configuration scripts, their value is preserved and
   the config scripts ensure that the variables are exported.
2. If the AWS_* variables are missing or empty, they will be unset
   to ensure that credentials supplied by Amazon's IAM roles can be
   accessed,
3. except if the scripts are running outside of EC2 (so there can be
   no IAM roles) and TARGET_FILESYSTEM is not set "s3". This combination
   is most often the case on a developer's local workstation.
   In this case the AWS_* credential variables are forcibly set to
   dummy values to allow the FE tests to succeed.
   The removal of S3 credential parameters from core-site.xml[.tmpl]
   also allows users to set up their own credentials there,
   the config scripts will not change those settings.

Environment variables carrying AWS security credentials will be set
up according to the following table:

    Instance:     Running outside EC2 ||  Running in EC2 |
--------------------+--------+--------++--------+--------+
  TARGET_FILESYSTEM |   S3   | not S3 ||   S3   | not S3 |
--------------------+--------+--------++--------+--------+
                    |        |        ||        |        |
              empty | unset  | dummy  ||  unset |  unset |
AWS_*               |        |        ||        |        |
env   --------------+--------+--------++--------+--------+
var                 |        |        ||        |        |
          not empty | export | export || export | export |
                    |        |        ||        |        |
--------------------+--------+--------++--------+--------+

Legend: unset:  the variable is unset
        export: the variable is exported with its current value
        dummy:  the variable is set to a preset dummy value and
                exported

Running on an EC2 VM is indicated by setting RUNNING_IN_EC2 to "true" and
exporting it before impala_config.sh is invoked.

The change also moves the logic performing the S3 access checks into a separate
script file: bin/check-s3-access.sh. This file now contains all the S3-specific
logic and network access to check if the requested S3 bucket can be accessed.

Testing:

Performed local builds for HDFS as well as automated builds against
HDFS and S3, using both IAM roles and explicit AWS_* credentials for
authentication.
Verified that FE tests that parse s3a: URLs are still successful in
all these combinations (when they are run).

Change-Id: I14cd9d4453a91baad3c379aa7e4944993fca95ae
Reviewed-on: http://gerrit.cloudera.org:8080/8294
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Impala Public Jenkins


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/e81b7c6b
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/e81b7c6b
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/e81b7c6b

Branch: refs/heads/master
Commit: e81b7c6b682fa4bf06db0fbd30818fe698890b00
Parents: 2e83ba5
Author: Laszlo Gaal <laszlo.gaal@cloudera.com>
Authored: Fri Oct 13 00:36:57 2017 +0200
Committer: Impala Public Jenkins <impala-public-jenkins@gerrit.cloudera.org>
Committed: Sat Dec 9 01:43:01 2017 +0000

----------------------------------------------------------------------
 bin/check-s3-access.sh                          | 115 +++++++++++++++++++
 bin/impala-config.sh                            |  73 +++++++++---
 .../apache/impala/analysis/AnalyzeDDLTest.java  |  12 +-
 .../impala/analysis/AnalyzeStmtsTest.java       |   4 -
 .../common/etc/hadoop/conf/core-site.xml.tmpl   |  24 +---
 5 files changed, 176 insertions(+), 52 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/e81b7c6b/bin/check-s3-access.sh
----------------------------------------------------------------------
diff --git a/bin/check-s3-access.sh b/bin/check-s3-access.sh
new file mode 100755
index 0000000..68df910
--- /dev/null
+++ b/bin/check-s3-access.sh
@@ -0,0 +1,115 @@
+#!/bin/bash
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# This file is invoked if TARGET_FILESYSTEM is set to "s3" to check
+# preconditions for accessing the specified S3 bucket. It inherits the
+# environment variables from its caller and uses them as
+# implicit parameters.
+#
+# The following environment variables are used:
+#   TARGET_FILESYSTEM
+#   S3_BUCKET
+#   AWS_ACCESS_KEY_ID
+#   AWS_SECRET_ACCESS_KEY
+#
+# Returns:
+#   0 (success): if preconditions for S3 access are satisfied.
+#   1 (failure): if S3 access is unsuccessful.
+#   2 (error): if the 'aws' executable is not on the path, or other
+#                environmental problems cause the script to fail.
+#
+# If tests are to be run against S3 as the backing file system, verify that
+# the assigned S3 bucket can be accessed.
+# Access can be authorized by AWS_ credentials passed in environment variables
+# or an EC2 IAM role assigned to the VM running the tests.
+#
+# If S3 access is granted via an IAM role assigned to the VM instance,
+# then the credentials bound to the IAM role are retrieved automatically
+# both by the Hadoop s3a: provider and by AWSCLI.
+# In this case AWS keys must not be present in environment variables or in
+# core-site.xml because their presence would preempt the IAM-based
+# credentials.
+#
+# For further details on IAM roles refer to the Amazon docs at
+# http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials
+#
+# The assigned IAM role and the security credentials provided
+# by the role can be queried through the AWS instance metadata mechanism.
+# Instance metadata is served through an HTTP connection to the special
+# address 169.254.169.254. Details are described at:
+# http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html#instancedata-data-retrieval
+AWS_METADATA_IP_ADDRESS="169.254.169.254"
+
+# safeguard against bad calls: if no S3 access is requested, just succeed
+# silently.
+if [ "${TARGET_FILESYSTEM-}" != "s3" ]; then
+  exit 0
+fi
+
+echo "Checking S3 access"
+
+# Check if the S3 bucket name is NULL.
+if [[ -z ${S3_BUCKET-} ]]; then
+  echo "Error: S3_BUCKET cannot be an empty string"
+  exit 1
+fi
+# S3 access can be granted using access keys passed in via the environment
+# or specifying an IAM role that has S3 access privileges.
+# First check the environment variables, they have precedence over the IAM
+# role: invalid credentials in the environment variables will prevent S3 access
+# even if a valid IAM role is present.
+# Use a subshell to prevent leaking AWS secrets.
+if (set +x; [[ -z ${AWS_ACCESS_KEY_ID-} && -z ${AWS_SECRET_ACCESS_KEY-} ]]); then
+  # If the environment variables are missing check the assumed IAM role.
+  # The IAM role can be queried via http://169.254.169.254/ using instance
+  # properties.
+  # Wget will fail if the address is not present (i.e. the script is not running on
+  # an EC2 VM) or the IAM role cannot be retrieved.
+  # Set short timeouts so the script is not blocked if run outside of EC2.
+
+  WGET_ARGS=(-T 1)
+  WGET_ARGS+=(-t 1)
+  WGET_ARGS+=(-q)
+  WGET_ARGS+=(-o /dev/null)
+  WGET_ARGS+=(http://${AWS_METADATA_IP_ADDRESS}/latest/meta-data/iam/security-credentials/)
+
+  if ! wget "${WGET_ARGS[@]}" ; then
+    echo \
+"Error: missing valid S3 credentials.
+You wanted to access an S3 bucket but you did not supply valid credentials.
+The AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables
+have to be assigned valid credentials that belong to the owner of the
+specified S3 bucket, or an IAM role authorized to access the S3 bucket
+has to be assigned to the VM instance if this is run inside an EC2 VM."
+    exit 1
+  fi
+fi
+
+if [ ! -x "$(command -v aws)" ] ; then
+  echo "Error: AWS CLI not found, unable to check S3 access."
+  exit 2
+fi
+
+aws s3 ls "s3://${S3_BUCKET}/" 1>/dev/null
+if [ $? != 0 ]; then
+  echo "Error: accessing S3_BUCKET '${S3_BUCKET}' failed."
+  exit 1
+else
+  exit 0
+fi
+

http://git-wip-us.apache.org/repos/asf/impala/blob/e81b7c6b/bin/impala-config.sh
----------------------------------------------------------------------
diff --git a/bin/impala-config.sh b/bin/impala-config.sh
index 2baac96..958e62e 100755
--- a/bin/impala-config.sh
+++ b/bin/impala-config.sh
@@ -285,8 +285,6 @@ export IMPALA_LZO="${IMPALA_LZO-$IMPALA_HOME/../Impala-lzo}"
 export IMPALA_AUX_TEST_HOME="${IMPALA_AUX_TEST_HOME-$IMPALA_HOME/../Impala-auxiliary-tests}"
 export TARGET_FILESYSTEM="${TARGET_FILESYSTEM-hdfs}"
 export FILESYSTEM_PREFIX="${FILESYSTEM_PREFIX-}"
-export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY-DummySecretAccessKey}"
-export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID-DummyAccessKeyId}"
 export S3_BUCKET="${S3_BUCKET-}"
 export azure_tenant_id="${azure_tenant_id-DummyAdlsTenantId}"
 export azure_client_id="${azure_client_id-DummyAdlsClientId}"
@@ -299,27 +297,64 @@ export WAREHOUSE_LOCATION_PREFIX="${WAREHOUSE_LOCATION_PREFIX-}"
 export LOCAL_FS="file:${WAREHOUSE_LOCATION_PREFIX}"
 export METASTORE_DB="hive_impala"
 
-if [ "${TARGET_FILESYSTEM}" = "s3" ]; then
-  # Basic error checking
-  if [[ "${AWS_ACCESS_KEY_ID}" = "DummyAccessKeyId" ||\
-        "${AWS_SECRET_ACCESS_KEY}" = "DummySecretAccessKey" ]]; then
-    echo "Both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
-      need to be assigned valid values and belong to the owner of the s3
-      bucket in order to access the file system"
-    return 1
+# Environment variables carrying AWS security credentials are prepared
+# according to the following rules:
+#
+#     Instance:     Running outside EC2 ||  Running in EC2 |
+# --------------------+--------+--------++--------+--------+
+#   TARGET_FILESYSTEM |   S3   | not S3 ||   S3   | not S3 |
+# --------------------+--------+--------++--------+--------+
+#                     |        |        ||        |        |
+#               empty | unset  | dummy  ||  unset |  unset |
+# AWS_*               |        |        ||        |        |
+# env   --------------+--------+--------++--------+--------+
+# var                 |        |        ||        |        |
+#           not empty | export | export || export | export |
+#                     |        |        ||        |        |
+# --------------------+--------+--------++--------+--------+
+#
+# Legend: unset:  the variable is unset
+#         export: the variable is exported with its current value
+#         dummy:  the variable is set to a constant dummy value and exported
+#
+# Running on an EC2 VM is indicated by setting RUNNING_IN_EC2 to "true" and
+# exporting it from an script running before this one.
+
+# Checks are performed in a subshell to avoid leaking secrets to log files.
+if (set +x; [[ -n ${AWS_ACCESS_KEY_ID-} ]]); then
+  export AWS_ACCESS_KEY_ID
+else
+  if [[ "${TARGET_FILESYSTEM}" == "s3" || "${RUNNING_IN_EC2:-false}" == "true" ]]; then
+    unset AWS_ACCESS_KEY_ID
+  else
+    export AWS_ACCESS_KEY_ID=DummyAccessKeyId
   fi
-  # Check if the s3 bucket is NULL.
-  if [[ "${S3_BUCKET}" = "" ]]; then
-    echo "S3_BUCKET cannot be an empty string for s3"
-    return 1
+fi
+
+if (set +x; [[ -n ${AWS_SECRET_ACCESS_KEY-} ]]); then
+  export AWS_SECRET_ACCESS_KEY
+else
+  if [[ "${TARGET_FILESYSTEM}" == "s3" || "${RUNNING_IN_EC2:-false}" == "true" ]]; then
+    unset AWS_SECRET_ACCESS_KEY
+  else
+    export AWS_SECRET_ACCESS_KEY=DummySecretAccessKey
   fi
-  aws s3 ls "s3://${S3_BUCKET}/" 1>/dev/null
-  if [ $? != 0 ]; then
-    echo "Access to ${S3_BUCKET} failed."
+fi
+
+# AWS_SESSION_TOKEN is not set to a dummy value, it is not needed by the FE tests
+if (set +x; [[ -n ${AWS_SESSION_TOKEN-} ]]); then
+  export AWS_SESSION_TOKEN
+else
+  unset AWS_SESSION_TOKEN
+fi
+
+if [ "${TARGET_FILESYSTEM}" = "s3" ]; then
+  if ${IMPALA_HOME}/bin/check-s3-access.sh; then
+    export DEFAULT_FS="s3a://${S3_BUCKET}"
+    export FILESYSTEM_PREFIX="${DEFAULT_FS}"
+  else
     return 1
   fi
-  DEFAULT_FS="s3a://${S3_BUCKET}"
-  export DEFAULT_FS
 elif [ "${TARGET_FILESYSTEM}" = "adls" ]; then
   # Basic error checking
   if [[ "${azure_client_id}" = "DummyAdlsClientId" ||\

http://git-wip-us.apache.org/repos/asf/impala/blob/e81b7c6b/fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
----------------------------------------------------------------------
diff --git a/fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java b/fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
index 6f6942a..09f5a73 100644
--- a/fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
+++ b/fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
@@ -246,7 +246,7 @@ public class AnalyzeDDLTest extends FrontendTestBase {
         "'hdfs://localhost:20500/test-warehouse/alltypes/year=2010/month=10'");
     AnalyzesOk("alter table functional.alltypes add " +
         "partition(year=2050, month=10) location " +
-        "'s3n://bucket/test-warehouse/alltypes/year=2010/month=10'");
+        "'s3a://bucket/test-warehouse/alltypes/year=2010/month=10'");
     AnalyzesOk("alter table functional.alltypes add " +
         "partition(year=2050, month=10) location " +
         "'file:///test-warehouse/alltypes/year=2010/month=10'");
@@ -628,7 +628,7 @@ public class AnalyzeDDLTest extends FrontendTestBase {
     AnalyzesOk("alter table functional.alltypes set location " +
         "'hdfs://localhost:20500/test-warehouse/a/b'");
     AnalyzesOk("alter table functional.alltypes set location " +
-        "'s3n://bucket/test-warehouse/a/b'");
+        "'s3a://bucket/test-warehouse/a/b'");
     AnalyzesOk("alter table functional.alltypes set location " +
         "'file:///test-warehouse/a/b'");
 
@@ -1391,7 +1391,7 @@ public class AnalyzeDDLTest extends FrontendTestBase {
         "CLASS 'com.bar.Foo' API_VERSION 'V1'");
     AnalyzesOk("CREATE DATA SOURCE foo LOCATION 'hdfs://localhost:20500/a/b/foo.jar' " +
         "CLASS 'com.bar.Foo' API_VERSION 'V1'");
-    AnalyzesOk("CREATE DATA SOURCE foo LOCATION 's3n://bucket/a/b/foo.jar' " +
+    AnalyzesOk("CREATE DATA SOURCE foo LOCATION 's3a://bucket/a/b/foo.jar' " +
         "CLASS 'com.bar.Foo' API_VERSION 'V1'");
 
     AnalysisError("CREATE DATA SOURCE foo LOCATION 'blah://localhost:20500/foo.jar' " +
@@ -1418,7 +1418,7 @@ public class AnalyzeDDLTest extends FrontendTestBase {
     AnalyzesOk("create database new_db location " +
         "'hdfs://localhost:50200/test-warehouse/new_db'");
     AnalyzesOk("create database new_db location " +
-        "'s3n://bucket/test-warehouse/new_db'");
+        "'s3a://bucket/test-warehouse/new_db'");
     // Invalid URI.
     AnalysisError("create database new_db location " +
         "'blah://bucket/test-warehouse/new_db'",
@@ -1673,7 +1673,7 @@ public class AnalyzeDDLTest extends FrontendTestBase {
     AnalyzesOk("create table tbl like functional.alltypes location " +
         "'file:/test-warehouse/new_table'");
     AnalyzesOk("create table tbl like functional.alltypes location " +
-        "'s3n://bucket/test-warehouse/new_table'");
+        "'s3a://bucket/test-warehouse/new_table'");
     // Invalid URI values.
     AnalysisError("create table tbl like functional.alltypes location " +
         "'foofs://test-warehouse/new_table'",
@@ -1895,7 +1895,7 @@ public class AnalyzeDDLTest extends FrontendTestBase {
     AnalyzesOk("create table tbl (i int) location " +
         "'file:///test-warehouse/new_table'");
     AnalyzesOk("create table tbl (i int) location " +
-        "'s3n://bucket/test-warehouse/new_table'");
+        "'s3a://bucket/test-warehouse/new_table'");
     AnalyzesOk("ALTER TABLE functional_seq_snap.alltypes SET LOCATION " +
         "'file://test-warehouse/new_table'");
 

http://git-wip-us.apache.org/repos/asf/impala/blob/e81b7c6b/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
----------------------------------------------------------------------
diff --git a/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java b/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
index 87d0761..54aa098 100644
--- a/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
+++ b/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
@@ -2997,10 +2997,6 @@ public class AnalyzeStmtsTest extends AnalyzerTest {
           "tpch.lineitem", "file:///test-warehouse/test.out", overwrite),
           "INPATH location 'file:/test-warehouse/test.out' must point to an " +
           "HDFS, S3A or ADL filesystem.");
-      AnalysisError(String.format("load data inpath '%s' %s into table " +
-          "tpch.lineitem", "s3n://bucket/test-warehouse/test.out", overwrite),
-          "INPATH location 's3n://bucket/test-warehouse/test.out' must point to an " +
-          "HDFS, S3A or ADL filesystem.");
 
       // File type / table type mismatch.
       AnalyzesOk(String.format("load data inpath '%s' %s into table " +

http://git-wip-us.apache.org/repos/asf/impala/blob/e81b7c6b/testdata/cluster/node_templates/common/etc/hadoop/conf/core-site.xml.tmpl
----------------------------------------------------------------------
diff --git a/testdata/cluster/node_templates/common/etc/hadoop/conf/core-site.xml.tmpl b/testdata/cluster/node_templates/common/etc/hadoop/conf/core-site.xml.tmpl
index 9ff4ee7..1f53e97 100644
--- a/testdata/cluster/node_templates/common/etc/hadoop/conf/core-site.xml.tmpl
+++ b/testdata/cluster/node_templates/common/etc/hadoop/conf/core-site.xml.tmpl
@@ -56,29 +56,7 @@ DEFAULT</value>
     <value>true</value>
   </property>
 
-  <!-- Needed by frontend tests that reference s3n paths. -->
-  <property>
-    <name>fs.s3n.awsAccessKeyId</name>
-    <value>${AWS_ACCESS_KEY_ID}</value>
-  </property>
-
-  <property>
-    <name>fs.s3n.awsSecretAccessKey</name>
-    <value>${AWS_SECRET_ACCESS_KEY}</value>
-  </property>
-
-  <!-- Needed when the TARGET_FILESYSTEM is s3 -->
-  <property>
-    <name>fs.s3a.awsAccessKeyId</name>
-    <value>${AWS_ACCESS_KEY_ID}</value>
-  </property>
-
-  <property>
-    <name>fs.s3a.awsSecretAccessKey</name>
-    <value>${AWS_SECRET_ACCESS_KEY}</value>
-  </property>
-
-  <property>
+ <property>
     <name>fs.s3a.connection.maximum</name>
     <value>1500</value>
   </property>


Mime
View raw message