Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id BCB70200D03 for ; Sat, 9 Sep 2017 21:51:10 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id BB5371609B5; Sat, 9 Sep 2017 19:51:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D8DF41609BF for ; Sat, 9 Sep 2017 21:51:09 +0200 (CEST) Received: (qmail 4048 invoked by uid 500); 9 Sep 2017 19:51:09 -0000 Mailing-List: contact commits-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list commits@airflow.incubator.apache.org Received: (qmail 4039 invoked by uid 99); 9 Sep 2017 19:51:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Sep 2017 19:51:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 75405181BAA for ; Sat, 9 Sep 2017 19:51:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id phX-snVq6xcF for ; Sat, 9 Sep 2017 19:51:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 462F45FB9F for ; Sat, 9 Sep 2017 19:51:06 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id A4E3AE00C9 for ; Sat, 9 Sep 2017 19:51:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 7439324147 for ; Sat, 9 Sep 2017 19:51:02 +0000 (UTC) Date: Sat, 9 Sep 2017 19:51:02 +0000 (UTC) From: "Michael Ghen (JIRA)" To: commits@airflow.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (AIRFLOW-1586) MySQL to GCS to BigQuery fails for tables with date types MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sat, 09 Sep 2017 19:51:10 -0000 [ https://issues.apache.org/jira/browse/AIRFLOW-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Ghen updated AIRFLOW-1586: ---------------------------------- Description: For tables in MySQL that use a "date" type, a dag that exports from MySQL to Google Cloud Storage and then loads from GCS to BigQuery will fail. When the table is exported as JSON to a GCS bucket, date fields are parsed as datetimes are parse, converted to timestamps. However, the schema file sets the type as "STRING." This creates a problem when you try to load the JSON into BigQuery using the schema file. *Steps to reproduce* 0. Set up a MySQL connection and GCP connection in Airflow. 1. Create a MySQL table with a "date" field and put some data into the table. {code} CREATE TABLE table_with_date ( date_field date, datetime_field datetime ); INSERT INTO table_with_date (date_field, datetime_field) VALUES ('2017-09-09',NOW()); {code} 2. Create a DAG that will export the data from the MySQL to GCS and then load from GCS to BigQuery (use the schema file). For example: {code} extract = MySqlToGoogleCloudStorageOperator( task_id="extract_table", mysql_conn_id='mysql_connection', google_cloud_storage_conn_id='gcp_connection', sql="SELECT * FROM table_with_date", bucket='gcs-bucket', filename='table_with_date.json', schema_filename='schemas/table_with_date.json', dag=dag) load = GoogleCloudStorageToBigQueryOperator( task_id="load_table", bigquery_conn_id='gcp_connection', google_cloud_storage_conn_id='gcp_connection', bucket='gcs-bucket', destination_project_dataset_table="dataset.table_with_date", source_objects=['table_with_date.json'], schema_object='schemas/table_with_date.json', source_format='NEWLINE_DELIMITED_JSON', create_disposition='CREATE_IF_NEEDED', write_disposition='WRITE_TRUNCATE', dag=dag) load.set_upstream(extract) {code} 3. Run the DAG Expected: The DAG runs successfully. Actual: The `load_table` task fails with error: {code} ... {u'reason': u'invalid', u'message': u'JSON parsing error in row starting at position 0: Could not convert value to string. Field: date_field; Value: 1504929600.000000', u'location': u'gs://gcs-bucket/table_with_date.json' ... {code} *Comments:* Seems like this was just a simple oversight in the section of `airflow/contrib/operators/mysql_to_gcs.py` where the types get converted. In `convert_types` both `date` and `datetime` types get converted to timestamps but in `type_map` there is no mapping for `FIELD_TYPE.date`. This small bug almost turned my team off from using airflow because we had a lot of tables that didn't flow into BigQuery because we used the `date` type a lot. was: For tables in MySQL that use a "date" type, a dag that exports from MySQL to Google Cloud Storage and then loads from GCS to BigQuery will fail. When the table is exported as JSON to a GCS bucket, date fields are parsed as datetimes are parse, converted to timestamps. However, the schema file sets the type as "STRING." This creates a problem when you try to load the JSON into BigQuery using the schema file. *Steps to reproduce* 0. Set up a MySQL connection and GCP connection in Airflow. 1. Create a MySQL table with a "date" field and put some data into the table. ``` CREATE TABLE table_with_date ( date_field date, datetime_field datetime ); INSERT INTO table_with_date (date_field, datetime_field) VALUES ('2017-09-09',NOW()); ``` 2. Create a DAG that will export the data from the MySQL to GCS and then load from GCS to BigQuery (use the schema file). For example: ``` extract = MySqlToGoogleCloudStorageOperator( task_id="extract_table", mysql_conn_id='mysql_connection', google_cloud_storage_conn_id='gcp_connection', sql="SELECT * FROM table_with_date", bucket='gcs-bucket', filename='table_with_date.json', schema_filename='schemas/table_with_date.json', dag=dag) load = GoogleCloudStorageToBigQueryOperator( task_id="load_table", bigquery_conn_id='gcp_connection', google_cloud_storage_conn_id='gcp_connection', bucket='gcs-bucket', destination_project_dataset_table="dataset.table_with_date", source_objects=['table_with_date.json'], schema_object='schemas/table_with_date.json', source_format='NEWLINE_DELIMITED_JSON', create_disposition='CREATE_IF_NEEDED', write_disposition='WRITE_TRUNCATE', dag=dag) load.set_upstream(extract) ``` 3. Run the DAG Expected: The DAG runs successfully. Actual: The `load_table` task fails with error: ``` ... {u'reason': u'invalid', u'message': u'JSON parsing error in row starting at position 0: Could not convert value to string. Field: date_field; Value: 1504929600.000000', u'location': u'gs://gcs-bucket/table_with_date.json' ... ``` *Comments:* Seems like this was just a simple oversight in the section of `airflow/contrib/operators/mysql_to_gcs.py` where the types get converted. In `convert_types` both `date` and `datetime` types get converted to timestamps but in `type_map` there is no mapping for `FIELD_TYPE.date`. This small bug almost turned my team off from using airflow because we had a lot of tables that didn't flow into BigQuery because we used the `date` type a lot. > MySQL to GCS to BigQuery fails for tables with date types > --------------------------------------------------------- > > Key: AIRFLOW-1586 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1586 > Project: Apache Airflow > Issue Type: Bug > Components: gcp > Environment: Google Cloud Platform > Reporter: Michael Ghen > Assignee: Michael Ghen > Priority: Minor > > For tables in MySQL that use a "date" type, a dag that exports from MySQL to Google Cloud Storage and then loads from GCS to BigQuery will fail. > When the table is exported as JSON to a GCS bucket, date fields are parsed as datetimes are parse, converted to timestamps. However, the schema file sets the type as "STRING." > This creates a problem when you try to load the JSON into BigQuery using the schema file. > *Steps to reproduce* > 0. Set up a MySQL connection and GCP connection in Airflow. > 1. Create a MySQL table with a "date" field and put some data into the table. > {code} > CREATE TABLE table_with_date ( > date_field date, > datetime_field datetime > ); > INSERT INTO table_with_date (date_field, datetime_field) VALUES ('2017-09-09',NOW()); > {code} > 2. Create a DAG that will export the data from the MySQL to GCS and then load from GCS to BigQuery (use the schema file). For example: > {code} > extract = MySqlToGoogleCloudStorageOperator( > task_id="extract_table", > mysql_conn_id='mysql_connection', > google_cloud_storage_conn_id='gcp_connection', > sql="SELECT * FROM table_with_date", > bucket='gcs-bucket', > filename='table_with_date.json', > schema_filename='schemas/table_with_date.json', > dag=dag) > load = GoogleCloudStorageToBigQueryOperator( > task_id="load_table", > bigquery_conn_id='gcp_connection', > google_cloud_storage_conn_id='gcp_connection', > bucket='gcs-bucket', > destination_project_dataset_table="dataset.table_with_date", > source_objects=['table_with_date.json'], > schema_object='schemas/table_with_date.json', > source_format='NEWLINE_DELIMITED_JSON', > create_disposition='CREATE_IF_NEEDED', > write_disposition='WRITE_TRUNCATE', > dag=dag) > load.set_upstream(extract) > {code} > 3. Run the DAG > Expected: The DAG runs successfully. > Actual: The `load_table` task fails with error: > {code} > ... > {u'reason': u'invalid', > u'message': u'JSON parsing error in row starting at position 0: Could not convert value to string. Field: date_field; Value: 1504929600.000000', > u'location': u'gs://gcs-bucket/table_with_date.json' > ... > {code} > *Comments:* > Seems like this was just a simple oversight in the section of `airflow/contrib/operators/mysql_to_gcs.py` where the types get converted. In `convert_types` both `date` and `datetime` types get converted to timestamps but in `type_map` there is no mapping for `FIELD_TYPE.date`. This small bug almost turned my team off from using airflow because we had a lot of tables that didn't flow into BigQuery because we used the `date` type a lot. -- This message was sent by Atlassian JIRA (v6.4.14#64029)