airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-1586) MySQL to GCS to BigQuery fails for tables with date types
Date Mon, 11 Sep 2017 11:12:00 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161082#comment-16161082
] 

ASF subversion and git services commented on AIRFLOW-1586:
----------------------------------------------------------

Commit e83012589bc6ac97421d462e32ab80ae05bb9817 in incubator-airflow's branch refs/heads/master
from Mike Ghen
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=e830125 ]

[AIRFLOW-1586] Add mapping for date type to mysql_to_gcs operator

Closes #2589 from mikeghen/bug/airflow-1586


> MySQL to GCS to BigQuery fails for tables with date types
> ---------------------------------------------------------
>
>                 Key: AIRFLOW-1586
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1586
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: gcp
>         Environment: Google Cloud Platform
>            Reporter: Michael Ghen
>            Assignee: Michael Ghen
>            Priority: Minor
>             Fix For: 1.9.0
>
>
> For tables in MySQL that use a "date" type, a dag that exports from MySQL to Google Cloud
Storage and then loads from GCS to BigQuery will fail.
> When the table is exported as JSON to a GCS bucket, date fields are parsed as datetimes
are parse, converted to timestamps. However, the schema file sets the type as "STRING."
> This creates a problem when you try to load the JSON into BigQuery using the schema file.
> *Steps to reproduce*
> 0. Set up a MySQL connection and GCP connection in Airflow.
> 1. Create a MySQL table with a "date" field and put some data into the table. 
> {code}
> CREATE TABLE table_with_date (
> date_field date,
> datetime_field datetime
> );
> INSERT INTO table_with_date (date_field, datetime_field) VALUES ('2017-09-09',NOW());
> {code}
> 2. Create a DAG that will export the data from the MySQL to GCS and then load from GCS
to BigQuery (use the schema file). For example:
> {code}
> extract = MySqlToGoogleCloudStorageOperator(
>         task_id="extract_table",
>         mysql_conn_id='mysql_connection',
>         google_cloud_storage_conn_id='gcp_connection',
>         sql="SELECT * FROM table_with_date",
>         bucket='gcs-bucket',
>         filename='table_with_date.json',
>         schema_filename='schemas/table_with_date.json',
>         dag=dag)
> load = GoogleCloudStorageToBigQueryOperator(
>         task_id="load_table",
>         bigquery_conn_id='gcp_connection',
>         google_cloud_storage_conn_id='gcp_connection',
>         bucket='gcs-bucket',
>         destination_project_dataset_table="dataset.table_with_date",
>         source_objects=['table_with_date.json'],
>         schema_object='schemas/table_with_date.json',
>         source_format='NEWLINE_DELIMITED_JSON',
>         create_disposition='CREATE_IF_NEEDED',
>         write_disposition='WRITE_TRUNCATE',
>         dag=dag)
> load.set_upstream(extract)
> {code}
> 3. Run the DAG 
> Expected: The DAG runs successfully.
> Actual: The `load_table` task fails with error:
> {code}
> ...
> {u'reason': u'invalid', 
> u'message': u'JSON parsing error in row starting at position 0: Could not convert value
to string. Field: date_field; Value: 1504929600.000000', 
> u'location': u'gs://gcs-bucket/table_with_date.json'
> ...
> {code}
> *Comments:*
> Seems like this was just a simple oversight in the section of `airflow/contrib/operators/mysql_to_gcs.py`
where the types get converted. In `convert_types` both `date` and `datetime` types get converted
to timestamps but in `type_map` there is no mapping for `FIELD_TYPE.date`. This small bug
almost turned my team off from using airflow because we had a lot of tables that didn't flow
into BigQuery because we used the `date` type a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message