airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Austin Bennett <whatwouldausti...@gmail.com>
Subject Re: IAM Permissions for AWS / S3_to_Redshift_Operator
Date Thu, 31 Jan 2019 06:01:03 GMT
Hi, Andrew, Yes this is sensible (and close to what I had in mind).  I had
imagined extending the copy_query to generate different text whether
relying upon a role or secret_/access_key (to additionally keep
current/base functionality).

This looks like it can work for my needs -- and hopefully would help others
with the issue; though I'd be curious about whether getting this baked into
the operator as an option (naturally required to keep the current
functionality, and without changing how to otherwise work with it) would be
desired/helpful more generally.  It is also still worth checking out
instance profiles -- I'll report back here to what I find with that.  And,
if not suitable, potentially take a crack at extending/parameterizing how
the copy command is generated.

Anyways, thanks!


On Wed, Jan 30, 2019 at 5:22 PM Andrew Harmon <andrewhharmon@gmail.com>
wrote:

> Here is the code we came up with. Pulled some stuff out for some weird
> stuff that only pertained to us. This is the execute() of the operator we
> wrote. Basically, it uses the PostgresHook to connect to Redshift. Then i
> stored the Role arn in the extras of the Airflow Postgres connection in a
> key called RedshiftRole. I grab that role from the connection and insert it
> into the copy command text.
>
> def execute(self, context):
>         self.redshift =
> PostgresHook(postgres_conn_id=self.redshift_conn_id)
>
>         role =
>
> self.redshift.get_connection('RedshiftConnection').extra_dejson['RedshiftRole']
>
>         copy_options = '\n\t\t\t'.join(self.copy_options)
>         copy_query = """
>             COPY {table}
>             FROM 's3://{s3_bucket}/{s3_key}'
>             with credentials
>             'aws_iam_role={role}'
>             {copy_options};
>         """.format(table=self.rs_table_name,
>                    s3_bucket=self.s3_bucket,
>                    s3_key=self.s3_prefix,
>                    role=role,
>
>  copy_options=copy_options.format(jsonpath=self.json_path))
>
>
>         self.log.info('Executing COPY command...')
>         self.redshift.run(copy_query, self.autocommit)
>         self.log.info("COPY command complete...")
>
> On Wed, Jan 30, 2019 at 8:04 PM Andrew Harmon <andrewhharmon@gmail.com>
> wrote:
>
> > I think what @ash is referring to is if you have an IAM role associated
> > with an EC2 instance, and your AWS connection in Airflow is left blank,
> > Boto3 will default tonthst role for any calls made by Boto3. However, in
> > this instance, Boto3 is not used, psycopg2 is used to make a connection
> to
> > Redshift, and you would need some way to pass in the role arn to the copy
> > text.
> >
> > On Wed, Jan 30, 2019 at 7:51 PM Austin Bennett <
> > whatwouldaustindo@gmail.com> wrote:
> >
> >> @Andrew, indeed, having to authenticate to Redshift, separate from
> >> credentials that allow S3 access, is how I work (outside of Airflow), so
> >> also sensible that would be how is done in Airflow.  I guess I should
> use
> >> ARN - rather than IAM - as the acronym (referring to the redshift-copy
> >> role/credentials).
> >>
> >> Certainly happy to go through your code!  Follow up (does the operator
> >> need
> >> to be extended for all?) becomes an open question of whether this is a
> >> need
> >> that isn't addressed with what is currently available (though I also am
> >> going to checkout what @ash mentions around instance profiles).
> >>
> >> @david our group standardized around practices of not using
> >> SECRET_ACCESS_KEY (at least as refers to tied to AWS "Users" -- I'm also
> >> not an expert with aws security, so might be messing up the
> terminology),
> >> which is why the aim is to not have to go against that.
> >>
> >>
> >>
> >> On Wed, Jan 30, 2019 at 4:08 PM Andrew Harmon <andrewhharmon@gmail.com>
> >> wrote:
> >>
> >> > Maybe just to clarify, to connect to Redshift and issue a COPY, you’ll
> >> need
> >> > a Redshift username and password. You would store that in a Postgres
> >> > connection. This is a un/pw in Redshift, not AWS creds. The SQL text
> >> needed
> >> > to issue the COPY requires either AWS creds or the arn of a role to
> use.
> >> >
> >> > Andrew
> >> >
> >> > On Wed, Jan 30, 2019 at 6:52 PM Andrew Harmon <
> andrewhharmon@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi, I extended the Redshift operator to pull the Role needed for the
> >> copy
> >> > > from the connection object. I stored the role arm in the extras of
> the
> >> > > connection. Works well so far. I’m not sure if that helps or if
> you’re
> >> > > looking for an out of the box solution, but I’d be happy to share
my
> >> code
> >> > > with you.
> >> > >
> >> > > Andrew
> >> > >
> >> > > On Wed, Jan 30, 2019 at 5:57 PM Austin Bennett <
> >> > > whatwouldaustindo@gmail.com> wrote:
> >> > >
> >> > >> @ash I'll look into that as an option.  Given I am still a novice
> >> user,
> >> > >> I'm
> >> > >> consistently impressed with the simplicity (once understood) given
> >> the
> >> > >> layers of abstractions.  I am not familiar enough with Instance
> >> profiles
> >> > >> to
> >> > >> say whether that is suitable.
> >> > >>
> >> > >> Was reading the copy_query default (
> >> > >>
> >> > >>
> >> >
> >>
> https://github.com/apache/airflow/blob/master/airflow/operators/s3_to_redshift_operator.py#L93
> >> > >> ):
> >> > >> copy_query = """
> >> > >> COPY {schema}.{table}
> >> > >> FROM 's3://{s3_bucket}/{s3_key}/{table}'
> >> > >> with credentials
> >> > >> 'aws_access_key_id={access_key};aws_secret_access_key={secret_key}'
> >> > >> {copy_options};
> >> > >> """.format(schema=self.schema,
> >> > >> table=self.table,
> >> > >> s3_bucket=self.s3_bucket,
> >> > >> s3_key=self.s3_key,
> >> > >> access_key=credentials.access_key,
> >> > >> secret_key=credentials.secret_key,
> >> > >> copy_options=copy_options)
> >> > >>
> >> > >> Which then seems like there could be issues if not providing keys.
> >> And
> >> > >> ultimately I'd like to issue an alternate (albeit similar)
> >> copy_query:
> >> > >>
> >> > >> Currently, for perhaps 100 tables, am issuing the command "COPY
> >> > >> <TABLENAME>
> >> > >> FROM 's3://<LOCATION>' iam_role
> >> > >> 'arn:aws:iam::<####>:role/<redshift-copy-unload>'
<OPTIONS
> DELIMITER
> >> > GZIP,
> >> > >> etc etc>;".
> >> > >>
> >> > >> Or, from their docs (linked in this thread):
> >> > >>
> >> > >> copy catdemo
> >> > >> from 's3://awssampledbuswest2/tickit/category_pipe.txt'
> >> > >> iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'
> >> > >> region 'us-west-2';
> >> > >>
> >> > >>
> >> > >> Was anchoring on getting the exact same statement built, which
> seems
> >> > >> straight forward (I could sketch that out, if you think that'd
be
> of
> >> > >> interest).  Perhaps that goal makes sense (whether that is the
best
> >> > >> solution is certainly a different story).
> >> > >>
> >> > >> I must acknowledge inexperience with Airflow, so believe there
is
> >> more
> >> > to
> >> > >> it than my simplistic approach.  Redshift with Airflow is queued
> for
> >> > >> another couple weeks (haven't even touched Redshift in months),
was
> >> > trying
> >> > >> to get ahead of things.  I can start with an empty connection
type
> >> and
> >> > >> reading up on instance profiles.
> >> > >>
> >> > >> Thank you for the input/feedback!
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >> On Wed, Jan 30, 2019 at 1:56 PM Ash Berlin-Taylor <ash@apache.org>
> >> > wrote:
> >> > >>
> >> > >> > If you create an "empty" connection of type "AWS" (i.e. don't
> >> specify
> >> > a
> >> > >> > username or password) then the AWSHook/S3Hook will use instance
> >> > >> profiles.
> >> > >> >
> >> > >> > Is that what you want?
> >> > >> >
> >> > >> > -ash
> >> > >> >
> >> > >> > > On 30 Jan 2019, at 18:45, Austin Bennett <
> >> > whatwouldaustindo@gmail.com
> >> > >> >
> >> > >> > wrote:
> >> > >> > >
> >> > >> > > Have started to push our group to standardizing on airflow.
 We
> >> > still
> >> > >> > have
> >> > >> > > a few large Redshift clusters.
> >> > >> > >
> >> > >> > > The s3_to_redshift_operator.py only appears to be written
to
> >> > >> authenticate
> >> > >> > > via secret/access-keys.  We no longer use Key Based
> >> authentication
> >> > and
> >> > >> > rely
> >> > >> > > upon Role Based, therefore IAM groups.
> >> > >> > >
> >> > >> > > What would the recommended way - in airflow - to rely
upon IAM
> >> for
> >> > >> > > authentication for COPY Commands to Redshift?  Should
I rely
> on a
> >> > >> plugin?
> >> > >> > > I'd be interested to contribute to the project by
> >> updating/extending
> >> > >> > > s3_to_redshift_operator.py should that be welcome.
> >> > >> > >
> >> > >> > >
> >> > >> > > References:
> >> > >> > > * https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
> >> > >> > > *
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> https://airflow.readthedocs.io/en/latest/_modules/airflow/operators/s3_to_redshift_operator.html
> >> > >> > >
> >> > >> > > *
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> https://github.com/apache/airflow/blob/master/airflow/operators/s3_to_redshift_operator.py
> >> > >> >
> >> > >> >
> >> > >>
> >> > > --
> >> > > Sent from Gmail Mobile
> >> > >
> >> >
> >>
> > --
> > Sent from Gmail Mobile
> >
>
>
> --
> Andrew Harmon
> (202) 615-6433
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message