airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Ghen (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AIRFLOW-1632) MySQL to GCS to BigQuery fails for dates before 1970
Date Thu, 21 Sep 2017 20:20:00 GMT
Michael Ghen created AIRFLOW-1632:
-------------------------------------

             Summary: MySQL to GCS to BigQuery fails for dates before 1970
                 Key: AIRFLOW-1632
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1632
             Project: Apache Airflow
          Issue Type: Bug
          Components: gcp
         Environment: Google Cloud Platform
            Reporter: Michael Ghen
            Assignee: Michael Ghen
            Priority: Minor
             Fix For: 1.9.0


For tables in MySQL that use a "date" type, a dag that exports from MySQL to Google Cloud
Storage and then loads from GCS to BigQuery will fail.

When the table is exported as JSON to a GCS bucket, date fields are parsed as datetimes are
parse, converted to timestamps. However, the schema file sets the type as "STRING."

This creates a problem when you try to load the JSON into BigQuery using the schema file.

*Steps to reproduce*
0. Set up a MySQL connection and GCP connection in Airflow.
1. Create a MySQL table with a "date" field and put some data into the table. 
{code}
CREATE TABLE table_with_date (
date_field date,
datetime_field datetime
);
INSERT INTO table_with_date (date_field, datetime_field) VALUES ('2017-09-09',NOW());
{code}
2. Create a DAG that will export the data from the MySQL to GCS and then load from GCS to
BigQuery (use the schema file). For example:
{code}
extract = MySqlToGoogleCloudStorageOperator(
        task_id="extract_table",
        mysql_conn_id='mysql_connection',
        google_cloud_storage_conn_id='gcp_connection',
        sql="SELECT * FROM table_with_date",
        bucket='gcs-bucket',
        filename='table_with_date.json',
        schema_filename='schemas/table_with_date.json',
        dag=dag)

load = GoogleCloudStorageToBigQueryOperator(
        task_id="load_table",
        bigquery_conn_id='gcp_connection',
        google_cloud_storage_conn_id='gcp_connection',
        bucket='gcs-bucket',
        destination_project_dataset_table="dataset.table_with_date",
        source_objects=['table_with_date.json'],
        schema_object='schemas/table_with_date.json',
        source_format='NEWLINE_DELIMITED_JSON',
        create_disposition='CREATE_IF_NEEDED',
        write_disposition='WRITE_TRUNCATE',
        dag=dag)

load.set_upstream(extract)
{code}
3. Run the DAG 

Expected: The DAG runs successfully.
Actual: The `load_table` task fails with error:
{code}
...
{u'reason': u'invalid', 
u'message': u'JSON parsing error in row starting at position 0: Could not convert value to
string. Field: date_field; Value: 1504929600.000000', 
u'location': u'gs://gcs-bucket/table_with_date.json'
...
{code}
*Comments:*
Seems like this was just a simple oversight in the section of `airflow/contrib/operators/mysql_to_gcs.py`
where the types get converted. In `convert_types` both `date` and `datetime` types get converted
to timestamps but in `type_map` there is no mapping for `FIELD_TYPE.date`. This small bug
almost turned my team off from using airflow because we had a lot of tables that didn't flow
into BigQuery because we used the `date` type a lot.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message