airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ilya Kisil (Jira)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-5060) Add support of CatalogId to AwsGlueCatalogHook
Date Wed, 20 Nov 2019 10:56:00 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978309#comment-16978309
] 

Ilya Kisil commented on AIRFLOW-5060:
-------------------------------------

[~jackjack10] another thing when extending glue hook with other available methods: unit tests.
Not sure how straight forward they would be etc.

> Add support of CatalogId to AwsGlueCatalogHook
> ----------------------------------------------
>
>                 Key: AIRFLOW-5060
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5060
>             Project: Apache Airflow
>          Issue Type: New Feature
>          Components: hooks
>    Affects Versions: 1.10.3
>            Reporter: Ilya Kisil
>            Assignee: Ilya Kisil
>            Priority: Minor
>
> h2. Use Case
> Imagine that you stream data into S3 bucket of an *account A* and update AWS Glue datacatalog
on a daily basis, so that you can query new data with AWS Athena. Now let's assume that you
provided access to this S3 bucket for an external *account B* who wants to use its' own
AWS Athena to query your data in an exactly the same way. Unfortunately, an *account B* would
need to have exactly the same table definitions in its AWS Glue Datacatalog, because AWS
Athena cannot run against external glue datacatalog. However, AWS Glue service supports [cross-account
datacatalog access|https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html],
which means that *account B* can simply copy/sync meta information about database, tables,
partitions etc from glue data catalog of an *account A*, provided additional permissions
have been granted. Thus, all methods in *AwsGlueCatalogHook* should an use "CatalogId", i.e. ID
of the Data Catalog from which to retrieve/create/delete.
> h2.  
> h2. How it fits into Airflow
> Assume that you have an AWSAthenaOperator, which queries data once a day, then result
is retrieved, visualised locally and then uploaded to some server/website. Then before this
happens, you simply need to create an operator (even PythonOperator would do) which has two
hooks, one to source catalog and another to destination catalog. At run time, it would use source
hook retrieve information from *account A*, for example [get_partitions()|https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partitions],
then parse response and remove unnseccary keys and finally use destination hook to update
*account B* datacatalog with [batch_create_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition]]
>  
> h2. Proposal
>  * Add a parameter *catalog_id* to AwsGlueCatalogHook, which then will be used in all
its methods, regardless of this hook associated with source or destination datacatalog. 
>  * In order not to break exsisting implementation, we set *catalog_id=None.* But we
add method *fallback_catalog_id(),* which uses AWS STS to infer Catalog ID associated with
used *aws_conn_id.* Obtained value * *would be used if *catalog_id* hasn't been provided
during hook creation.
>  * Extend available methods of *AwsGlueCatalogHook* in a similar way to already exsisting
once, for convenience of the workflow described above. Note: all new methods should strictly
adhere AWS Glue Client Request Syntax and do it in transparent manner. This means, that input
information shouldn't be modified within a method. When such actions are required, they should
be performed outside of the AwsGlueCatalogHook.
> h2. Implementation
>  * I am happy to contribute to airflow if this feature request gets approved.
> h2. Other considerations
>  * At the moment an existing method *get_partitions* doesn't not provide you with all
metainformation about partitions available from glue client, whereas *get_table* does.
Don't know the best way around it, but imho it should be refactored to *get_partitions_values* or
something like that. In this way, we would be able to stay inline with boto3 glue client.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message