airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Ghen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AIRFLOW-1632) MySQL to GCS fails for date/datetime before ~1850
Date Thu, 21 Sep 2017 20:40:00 GMT

     [ https://issues.apache.org/jira/browse/AIRFLOW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael Ghen updated AIRFLOW-1632:
----------------------------------
    Description: 
For tables in MySQL that use a "date" or "datetime" type, a dag that exports from MySQL to
Google Cloud Storage and then loads from GCS to BigQuery will fail when the dates are before
1970.

When the table is exported as JSON to a GCS bucket, dates and datetimes are converted to timestamps
using:

{code}
time.mktime(value.timetuple())
{code} 

This creates a problem when you try parse a date that can't be converted to a UNIX timestamp.
For example:

{code}
>>> value = datetime.date(1850,1,1)
>>> time.mktime(value.timetuple())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: year out of range
{code}

*Steps to reproduce*
0. Set up a MySQL connection and GCP connection in Airflow.
1. Create a MySQL table with a "date" field and put some data into the table. 
{code}
CREATE TABLE table_with_date (
date_field date,
datetime_field datetime
);
INSERT INTO table_with_date (date_field, datetime_field) VALUES ('1850-01-01',NOW());
{code}
2. Create a DAG that will export the data from the MySQL to GCS and then load from GCS to
BigQuery (use the schema file). For example:
{code}
extract = MySqlToGoogleCloudStorageOperator(
        task_id="extract_table",
        mysql_conn_id='mysql_connection',
        google_cloud_storage_conn_id='gcp_connection',
        sql="SELECT * FROM table_with_date",
        bucket='gcs-bucket',
        filename='table_with_date.json',
        schema_filename='schemas/table_with_date.json',
        dag=dag)

load = GoogleCloudStorageToBigQueryOperator(
        task_id="load_table",
        bigquery_conn_id='gcp_connection',
        google_cloud_storage_conn_id='gcp_connection',
        bucket='gcs-bucket',
        destination_project_dataset_table="dataset.table_with_date",
        source_objects=['table_with_date.json'],
        schema_object='schemas/table_with_date.json',
        source_format='NEWLINE_DELIMITED_JSON',
        create_disposition='CREATE_IF_NEEDED',
        write_disposition='WRITE_TRUNCATE',
        dag=dag)

load.set_upstream(extract)
{code}
3. Run the DAG 

Expected: The DAG runs successfully.
Actual: The `extract_table` task fails with error:
{code}
...
 ERROR - year out of range
 Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/airflow/models.py", line 1374, in run
     result = task_copy.execute(context=context)
  File "/usr/lib/python2.7/site-packages/airflow/contrib/operators/mysql_to_gcs.py", line
91, in execute
    files_to_upload = self._write_local_data_files(cursor)
  File "/usr/lib/python2.7/site-packages/airflow/contrib/operators/mysql_to_gcs.py", line
132, in _write_local_data_files
    row = map(self.convert_types, row)
  File "/usr/lib/python2.7/site-packages/airflow/contrib/operators/mysql_to_gcs.py", line
196, in convert_types
    return time.mktime(value.timetuple())
ValueError: year out of range
...
{code}
*Comments:*
This is really a problem with Python not being able to handle years before like 1850. Bigquery
timestamp seems to be able to take years all the way to year 0001. From, https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#timestamp-type,
Timestamp range is:
{quote}
0001-01-01 00:00:00 to 9999-12-31 23:59:59.999999 UTC.
{quote}
I think the fix is probably to keep date/datetime converting to timestamp but use `calendar.timegm`


  was:
For tables in MySQL that use a "date" or "datetime" type, a dag that exports from MySQL to
Google Cloud Storage and then loads from GCS to BigQuery will fail when the dates are before
1970.

When the table is exported as JSON to a GCS bucket, dates and datetimes are converted to timestamps
using:

{code}
time.mktime(value.timetuple())
{code} 

This creates a problem when you try parse a date that can't be converted to a UNIX timestamp.
For example:

{code}
>>> value = datetime.date(1850,1,1)
>>> time.mktime(value.timetuple())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: year out of range
{code}

*Steps to reproduce*
0. Set up a MySQL connection and GCP connection in Airflow.
1. Create a MySQL table with a "date" field and put some data into the table. 
{code}
CREATE TABLE table_with_date (
date_field date,
datetime_field datetime
);
INSERT INTO table_with_date (date_field, datetime_field) VALUES ('1850-01-01',NOW());
{code}
2. Create a DAG that will export the data from the MySQL to GCS and then load from GCS to
BigQuery (use the schema file). For example:
{code}
extract = MySqlToGoogleCloudStorageOperator(
        task_id="extract_table",
        mysql_conn_id='mysql_connection',
        google_cloud_storage_conn_id='gcp_connection',
        sql="SELECT * FROM table_with_date",
        bucket='gcs-bucket',
        filename='table_with_date.json',
        schema_filename='schemas/table_with_date.json',
        dag=dag)

load = GoogleCloudStorageToBigQueryOperator(
        task_id="load_table",
        bigquery_conn_id='gcp_connection',
        google_cloud_storage_conn_id='gcp_connection',
        bucket='gcs-bucket',
        destination_project_dataset_table="dataset.table_with_date",
        source_objects=['table_with_date.json'],
        schema_object='schemas/table_with_date.json',
        source_format='NEWLINE_DELIMITED_JSON',
        create_disposition='CREATE_IF_NEEDED',
        write_disposition='WRITE_TRUNCATE',
        dag=dag)

load.set_upstream(extract)
{code}
3. Run the DAG 

Expected: The DAG runs successfully.
Actual: The `extract_table` task fails with error:
{code}
...
 ERROR - year out of range
 Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/airflow/models.py", line 1374, in run
     result = task_copy.execute(context=context)
  File "/usr/lib/python2.7/site-packages/airflow/contrib/operators/mysql_to_gcs.py", line
91, in execute
    files_to_upload = self._write_local_data_files(cursor)
  File "/usr/lib/python2.7/site-packages/airflow/contrib/operators/mysql_to_gcs.py", line
132, in _write_local_data_files
    row = map(self.convert_types, row)
  File "/usr/lib/python2.7/site-packages/airflow/contrib/operators/mysql_to_gcs.py", line
196, in convert_types
    return time.mktime(value.timetuple())
ValueError: year out of range
...
{code}
*Comments:*
This is really a problem with Python not being able to handle years before like 1850. Bigquery
timestamp seems to be able to take years all the way to year 0001. From, https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#timestamp-type,
Timestamp range is:
{quote}
0001-01-01 00:00:00 to 9999-12-31 23:59:59.999999 UTC.
{quote}



> MySQL to GCS fails for date/datetime before ~1850
> -------------------------------------------------
>
>                 Key: AIRFLOW-1632
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1632
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: gcp
>         Environment: Google Cloud Platform
>            Reporter: Michael Ghen
>            Assignee: Michael Ghen
>            Priority: Minor
>
> For tables in MySQL that use a "date" or "datetime" type, a dag that exports from MySQL
to Google Cloud Storage and then loads from GCS to BigQuery will fail when the dates are before
1970.
> When the table is exported as JSON to a GCS bucket, dates and datetimes are converted
to timestamps using:
> {code}
> time.mktime(value.timetuple())
> {code} 
> This creates a problem when you try parse a date that can't be converted to a UNIX timestamp.
For example:
> {code}
> >>> value = datetime.date(1850,1,1)
> >>> time.mktime(value.timetuple())
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> ValueError: year out of range
> {code}
> *Steps to reproduce*
> 0. Set up a MySQL connection and GCP connection in Airflow.
> 1. Create a MySQL table with a "date" field and put some data into the table. 
> {code}
> CREATE TABLE table_with_date (
> date_field date,
> datetime_field datetime
> );
> INSERT INTO table_with_date (date_field, datetime_field) VALUES ('1850-01-01',NOW());
> {code}
> 2. Create a DAG that will export the data from the MySQL to GCS and then load from GCS
to BigQuery (use the schema file). For example:
> {code}
> extract = MySqlToGoogleCloudStorageOperator(
>         task_id="extract_table",
>         mysql_conn_id='mysql_connection',
>         google_cloud_storage_conn_id='gcp_connection',
>         sql="SELECT * FROM table_with_date",
>         bucket='gcs-bucket',
>         filename='table_with_date.json',
>         schema_filename='schemas/table_with_date.json',
>         dag=dag)
> load = GoogleCloudStorageToBigQueryOperator(
>         task_id="load_table",
>         bigquery_conn_id='gcp_connection',
>         google_cloud_storage_conn_id='gcp_connection',
>         bucket='gcs-bucket',
>         destination_project_dataset_table="dataset.table_with_date",
>         source_objects=['table_with_date.json'],
>         schema_object='schemas/table_with_date.json',
>         source_format='NEWLINE_DELIMITED_JSON',
>         create_disposition='CREATE_IF_NEEDED',
>         write_disposition='WRITE_TRUNCATE',
>         dag=dag)
> load.set_upstream(extract)
> {code}
> 3. Run the DAG 
> Expected: The DAG runs successfully.
> Actual: The `extract_table` task fails with error:
> {code}
> ...
>  ERROR - year out of range
>  Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/airflow/models.py", line 1374, in run
>      result = task_copy.execute(context=context)
>   File "/usr/lib/python2.7/site-packages/airflow/contrib/operators/mysql_to_gcs.py",
line 91, in execute
>     files_to_upload = self._write_local_data_files(cursor)
>   File "/usr/lib/python2.7/site-packages/airflow/contrib/operators/mysql_to_gcs.py",
line 132, in _write_local_data_files
>     row = map(self.convert_types, row)
>   File "/usr/lib/python2.7/site-packages/airflow/contrib/operators/mysql_to_gcs.py",
line 196, in convert_types
>     return time.mktime(value.timetuple())
> ValueError: year out of range
> ...
> {code}
> *Comments:*
> This is really a problem with Python not being able to handle years before like 1850.
Bigquery timestamp seems to be able to take years all the way to year 0001. From, https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#timestamp-type,
Timestamp range is:
> {quote}
> 0001-01-01 00:00:00 to 9999-12-31 23:59:59.999999 UTC.
> {quote}
> I think the fix is probably to keep date/datetime converting to timestamp but use `calendar.timegm`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message