airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AIRFLOW-1560) Add AWS DynamoDB hook for inserting batch items
Date Mon, 04 Sep 2017 22:10:00 GMT

     [ https://issues.apache.org/jira/browse/AIRFLOW-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Siddharth updated AIRFLOW-1560:
-------------------------------
    Description: 
The PR addresses airflow integration with AWS Dynamodb.  

Currently there is no hook to interact with DynamoDb for reading or writing items (single
or batch insertions). To get started, we want to push data in DynamoDB using airflow jobs
(scheduled daily). Idea is to read aggregates from S3 and push in DynamoDB (write data job
will run everyday to make this happen). First we want to create DynamoDB hooks (this PR addressed
the same) and then create operator to move data from S3 to DynamoDB (added hive to dynamo
transfer operator)

I noticed that currently airflow has AWS_HOOK (parent hook for connecting to AWS using credentials
stored in configs). It has a function to connect to AWS objects using Client API (http://boto3.readthedocs.io/en/latest/reference/services/dynamodb.html#client)
which is specific to EMR_HOOK. But in case of inserting data we can use DynamoDB Resource
API (http://boto3.readthedocs.io/en/latest/reference/services/dynamodb.html#service-resource)
which provides higher level abstractions for inserting data in DynamoDB). One good question
to ask can be difference between client and resource and why use one or the other? "Resources
are higher-level abstraction than the raw, low-level calls made by service clients. They can't
do anything the clients can't do, but in many cases they are nicer to use. The downside is
that they don't always support 100% of the features of a service." (http://boto3.readthedocs.io/en/latest/guide/resources.html)



  was:
The PR addresses airflow integration with AWS Dynamodb.  

Currently there is no hook to interact with DynamoDb for reading or writing items (single
or batch insertions). To get started, we want to push data in DynamoDB using airflow jobs
(scheduled daily). Idea is to read aggregates from S3 and push in DynamoDB (write data job
will run everyday to make this happen). First we want to create DynamoDB hooks (this PR addressed
the same) and then create operator to move data from S3 to DynamoDB.

I noticed that currently airflow has AWS_HOOK (parent hook for connecting to AWS using credentials
stored in configs). It has a function to connect to AWS objects using Client API (http://boto3.readthedocs.io/en/latest/reference/services/dynamodb.html#client)
which is specific to EMR_HOOK. But in case of inserting data we can use DynamoDB Resource
API (http://boto3.readthedocs.io/en/latest/reference/services/dynamodb.html#service-resource)
which provides higher level abstractions for inserting data in DynamoDB). One good question
to ask can be difference between client and resource and why use one or the other? "Resources
are higher-level abstraction than the raw, low-level calls made by service clients. They can't
do anything the clients can't do, but in many cases they are nicer to use. The downside is
that they don't always support 100% of the features of a service." (http://boto3.readthedocs.io/en/latest/guide/resources.html)




> Add AWS DynamoDB hook for inserting batch items
> -----------------------------------------------
>
>                 Key: AIRFLOW-1560
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1560
>             Project: Apache Airflow
>          Issue Type: New Feature
>          Components: aws, boto3, hooks
>            Reporter: Siddharth
>            Assignee: Siddharth
>
> The PR addresses airflow integration with AWS Dynamodb.  
> Currently there is no hook to interact with DynamoDb for reading or writing items (single
or batch insertions). To get started, we want to push data in DynamoDB using airflow jobs
(scheduled daily). Idea is to read aggregates from S3 and push in DynamoDB (write data job
will run everyday to make this happen). First we want to create DynamoDB hooks (this PR addressed
the same) and then create operator to move data from S3 to DynamoDB (added hive to dynamo
transfer operator)
> I noticed that currently airflow has AWS_HOOK (parent hook for connecting to AWS using
credentials stored in configs). It has a function to connect to AWS objects using Client API
(http://boto3.readthedocs.io/en/latest/reference/services/dynamodb.html#client) which is specific
to EMR_HOOK. But in case of inserting data we can use DynamoDB Resource API (http://boto3.readthedocs.io/en/latest/reference/services/dynamodb.html#service-resource)
which provides higher level abstractions for inserting data in DynamoDB). One good question
to ask can be difference between client and resource and why use one or the other? "Resources
are higher-level abstraction than the raw, low-level calls made by service clients. They can't
do anything the clients can't do, but in many cases they are nicer to use. The downside is
that they don't always support 100% of the features of a service." (http://boto3.readthedocs.io/en/latest/guide/resources.html)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message