airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ash Berlin-Taylor <ash_airflowl...@firemirror.com>
Subject Re: [VOTE] Airflow 1.10.0rc3
Date Sun, 05 Aug 2018 20:43:59 GMT
Entirely possible, though I wasn't even dealing with the scheduler - the issue I was addressing
was entirely in the webserver for a pre-existing Task Instance.

Ah, I hadn't noticed/twigged we are using sqlalchemy-utc. It appears that isn't working right/
as expected. This line: https://github.com/spoqa/sqlalchemy-utc/blob/master/sqlalchemy_utc/sqltypes.py#L34
doens't look right for us - as you mentioned the TZ is set to something (rather than having
no TZ value).

Some background on how Pq handles TZs. It always returns DTs in the TZ of the connection.
I'm not sure if this is unique to postgres or if other DBs behave the same.

postgres=# select '2018-08-03 00:00:00+00:00'::timestamp with time zone;
      timestamptz
------------------------
 2018-08-03 01:00:00+01

postgres=# select '2018-08-03 02:00:00+02'::timestamp with time zone;
      timestamptz
------------------------
 2018-08-03 01:00:00+01

The server will always return TZs in the connection timezone.

postgres=# set timezone=utc;
SET
postgres=# select '2018-08-03 02:00:00+02'::timestamp with time zone;
      timestamptz
------------------------
 2018-08-03 00:00:00+00
(1 row)

postgres=# select '2018-08-03 01:00:00+01'::timestamp with time zone;
      timestamptz
------------------------
 2018-08-03 00:00:00+00
(1 row)




-ash

> On 5 Aug 2018, at 21:28, Bolke de Bruin <bdbruin@gmail.com> wrote:
> 
> This is the issue:
> 
> [2018-08-05 22:08:21,952] {jobs.py:906} INFO - NEXT RUN DATE: 2018-08-03 00:00:00+00:00
tzinfo: <Timezone [UTC]>
> [2018-08-05 22:08:22,007] {jobs.py:1425} INFO - Created <DagRun example_http_operator
@ 2018-08-03 02:00:00+02:00: scheduled__2018-08-03T00:00:00+00:00, externally triggered: False>
> 
> [2018-08-05 22:08:24,651] {jobs.py:906} INFO - NEXT RUN DATE: 2018-08-04 02:00:00+02:00
tzinfo: psycopg2.tz.FixedOffsetTimezone(offset=120, name=None)
> [2018-08-05 22:08:24,696] {jobs.py:1425} INFO - Created <DagRun example_http_operator
@ 2018-08-04 02:00:00+02:00: scheduled__2018-08-04T02:00:00+02:00, externally triggered: False>
> 
> Notice at line 1+2: that the next run date is correctly in UTC but from the DB it gets
a +2. At the next bit (3+4) we get a psycopg2.tz.FixedOffsetTimezone which should be set to
UTC according to the specs of https://github.com/spoqa/sqlalchemy-utc <https://github.com/spoqa/sqlalchemy-utc>
, but it isn’t. 
> 
> So changing your setting of the DB to UTC fixes the symptom but not the cause.
> 
> B.
> 
>> On 5 Aug 2018, at 22:03, Ash Berlin-Taylor <ash_airflowlist@firemirror.com>
wrote:
>> 
>> Sorry for being terse before.
>> 
>> So the issue is that the ts loaded from the DB is not in UTC, it's in GB/+01 (the
default of the DB server)
>> 
>> For me, on a currently running 1.9 (no TZ) db:
>> 
>> airflow=# select * from task_instance;
>> get_op            | example_http_operator | 2018-07-23 00:00:00
>> 
>> This date time appears in the log url, and the path it looks at on S3 is 
>> 
>> .../example_http_operator/2018-07-23T00:00:00/1.log
>> 
>> If my postgres server has a default timezone of GB (which the one running on my laptop
does), and I then apply the migration then it is converted to that local time.
>> 
>> airflow=# select * from task_instance;
>> get_op            | example_http_operator | 2018-07-23 01:00:00+01
>> 
>> airflow=# set timezone=UTC;
>> airflow=# select * from task_instance;
>> get_op            | example_http_operator | 2018-07-23 00:00:00+00
>> 
>> 
>> This is all okay so far. The migration has kept the column at the same moment in
time.
>> 
>> The issue come when the UI tries to display logs for this old task: because the timezone
of the connection is not UTC, PG returns a date with a +01 TZ. Thus after the migration this
old task tries to look for a log file of
>> 
>> .../example_http_operator/2018-07-23T01:00:00/1.log
>> 
>> which doesn't exist - it's changed the time it has rendered from midnight (in v1.9)
to 1am (in v1.10).
>> 
>> (This is with my change to log_filename_template from UPDATING.md in my other branch)
>> 
>> Setting the timezone to UTC per connection means the behaviour of Airflow doesn't
change depending on how the server is configured.
>> 
>> -ash
>> 
>>> On 5 Aug 2018, at 20:58, Bolke de Bruin <bdbruin@gmail.com> wrote:
>>> 
>>> Digging in a bit further. 
>>> 
>>> {{{{ ti.dag_id }}}}/{{{{ ti.task_id }}}}/{{{{ ts }}}}/{{{{ try_number }}}}.log
>>> 
>>> is the format
>>> 
>>> ts = execution_date.isoformat and should be in UTC afaik.
>>> 
>>> something is weird tbh.
>>> 
>>> B.
>>> 
>>> 
>>>> On 5 Aug 2018, at 21:32, Bolke de Bruin <bdbruin@gmail.com> wrote:
>>>> 
>>>> Ash,
>>>> 
>>>> Reading your proposed changes on your “set-timezone-to-utc” branch and
below analysis, I am not sure what you are perceiving as an issue.
>>>> 
>>>> For conversion we assume everything is stored in UTC and in a naive format.
Conversion then adds the timezone information. This results in the following
>>>> 
>>>> postgres timezone = “Europe/Amsterdam”
>>>> 
>>>> 
>>>> airflow=# select * from task_instance;
>>>> get_op            | example_http_operator | 2018-07-27 02:00:00+02
>>>> 
>>>> airflow=# set timezone=UTC;
>>>> airflow=# select * from task_instance;
>>>> get_op            | example_http_operator | 2018-07-27 00:00:00+00
>>>> 
>>>> If we don’t set the timezone in the connection postgres assumes server
timezone (in my case “Europe/Amsterdam”). So every datetime Airflow receives will be in
“Europe/Amsterdam” format. However as we defined the model to use UTCDateTime it will
always convert the returned DateTime to UTC.
>>>> 
>>>> If we have configured Airflow to support something else as UTC as the default
timezone or a DAG has a associated timezone we only convert to that timezone when calculating
the next runtime (not for cron btw). Nowhere else and thus we are UTC everywhere.
>>>> 
>>>> What do you think is inconsistent?
>>>> 
>>>> Bolke
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 5 Aug 2018, at 18:13, Ash Berlin-Taylor <ash_airflowlist@firemirror.com>
wrote:
>>>>> 
>>>>> Relating to 2): I'm not sure that the upgrade from timezoneless to timezone
aware colums in the task instance is right, or at least it's not what I expected.
>>>>> 
>>>>> Before weren't all TZs from schedule dates etc in UTC? For the same task
instance (these outputs from psql directly):
>>>>> 
>>>>> before: execution_date=2017-09-04 00:00:00
>>>>> after: execution_date=2017-09-04 01:00:00+01
>>>>> 
>>>>> **Okay the migration is fine**. It appears that the migration has done
the right thing, but my local DB I'm testing with has a Timezone of GB set, so Postgres converts
it to that TZ on returning an object.
>>>>> 
>>>>> 3) Do we need to set the TZ of the connection to UTC in SQLAlchemy to
have consistent behaviour? Is this possible some how? I don't know SQLAlchemy that well.
>>>>> 
>>>>> 
>>>>> -ash
>>>>> 
>>>>>> On 5 Aug 2018, at 16:01, Ash Berlin-Taylor <ash_airflowlist@firemirror.com>
wrote:
>>>>>> 
>>>>>> 1.) Missing UPDATING note about change of task_log_reader to now
always being "task" (was "s3.task" before.). Logging config is much simpler now though. This
may be particular to my logging config, but given how much of a pain it was to set up S3 logging
in 1.9 I have shared my config with some people in the Gitter chat so It's not just me.
>>>>>> 
>>>>>> 2) The path that log-files are written to in S3 has changed (again
- this happened from 1.8 to 1.9). I'd like to avoid having to move all of my log files again
to continue viewing them. The change is that the path now (in 1.10) has a timezone in it,
and the date is in local time, before it was UTC:
>>>>>> 
>>>>>> before: 2018-07-23T00:00:00/1.log
>>>>>> after: 2018-07-23T01:00:00+01:00/1.log
>>>>>> 
>>>>>> We can possibly get away with an updating note about this to set
a custom log_filename_template. Testing this now.
>>>>>> 
>>>>>>> On 5 Aug 2018, at 15:00, Ash Berlin-Taylor <ash@firemirror.com>
wrote:
>>>>>>> 
>>>>>>> -1(binding) from me.
>>>>>>> 
>>>>>>> Installed with:
>>>>>>> 
>>>>>>> AIRFLOW_GPL_UNIDECODE=yes pip install 'https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc3/apache-airflow-1.10.0rc3+incubating-bin.tar.gz#egg=apache-airflow[emr
<https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc3/apache-airflow-1.10.0rc3+incubating-bin.tar.gz#egg=apache-airflow[emr>,
s3, crypto]>=1.10'
>>>>>>> 
>>>>>>> Install went fine.
>>>>>>> 
>>>>>>> Our DAGs that use SparkSubmitOperator are now failing as there
is now a hard dependency on the Kubernetes client libs, but the `emr` group doesn't mention
this.
>>>>>>> 
>>>>>>> Introduced in https://github.com/apache/incubator-airflow/pull/3112
<https://github.com/apache/incubator-airflow/pull/3112>
>>>>>>> 
>>>>>>> I see two options for this - either conditionally enable k8s://
support if the import works, or (less preferred) add kube-client to the emr deps (which I
like less)
>>>>>>> 
>>>>>>> Sorry - this is the first time I've been able to test it.
>>>>>>> 
>>>>>>> I will install this dep manually and continue testing.
>>>>>>> 
>>>>>>> -ash
>>>>>>> 
>>>>>>> (Normally no time at home due to new baby, but I got a standing
desk, and a carrier meaning she can sleep on me and I can use my laptop. Win!)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 4 Aug 2018, at 22:32, Bolke de Bruin <bdbruin@gmail.com
<mailto:bdbruin@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Bump. 
>>>>>>>> 
>>>>>>>> Committers please cast your vote. 
>>>>>>>> 
>>>>>>>> B.
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>>>>> 
>>>>>>>>> On 3 Aug 2018, at 13:23, Driesprong, Fokko <fokko@driesprong.frl
<mailto:fokko@driesprong.frl>> wrote:
>>>>>>>>> 
>>>>>>>>> +1 Binding
>>>>>>>>> 
>>>>>>>>> Installed it using: SLUGIFY_USES_TEXT_UNIDECODE=yes pip
install
>>>>>>>>> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc3/apache-airflow-1.10.0rc3+incubating-bin.tar.gz
<https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc3/apache-airflow-1.10.0rc3+incubating-bin.tar.gz>
>>>>>>>>> 
>>>>>>>>> Cheers, Fokko
>>>>>>>>> 
>>>>>>>>> 2018-08-03 9:47 GMT+02:00 Bolke de Bruin <bdbruin@gmail.com>:
>>>>>>>>> 
>>>>>>>>>> Hey all,
>>>>>>>>>> 
>>>>>>>>>> I have cut Airflow 1.10.0 RC3. This email is calling
a vote on the release,
>>>>>>>>>> which will last for 72 hours. Consider this my (binding)
+1.
>>>>>>>>>> 
>>>>>>>>>> Airflow 1.10.0 RC 3 is available at:
>>>>>>>>>> 
>>>>>>>>>> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc3/
<
>>>>>>>>>> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc3/>
>>>>>>>>>> 
>>>>>>>>>> apache-airflow-1.10.0rc3+incubating-source.tar.gz
is a source release that
>>>>>>>>>> comes with INSTALL instructions.
>>>>>>>>>> apache-airflow-1.10.0rc3+incubating-bin.tar.gz is
the binary Python
>>>>>>>>>> "sdist"
>>>>>>>>>> release.
>>>>>>>>>> 
>>>>>>>>>> Public keys are available at:
>>>>>>>>>> 
>>>>>>>>>> https://dist.apache.org/repos/dist/release/incubator/airflow/
<
>>>>>>>>>> https://dist.apache.org/repos/dist/release/incubator/airflow/>
>>>>>>>>>> 
>>>>>>>>>> The amount of JIRAs fixed is over 700. Please have
a look at the
>>>>>>>>>> changelog.
>>>>>>>>>> Since RC2 the following has been fixed:
>>>>>>>>>> 
>>>>>>>>>> * [AIRFLOW-2817] Force explicit choice on GPL dependency
>>>>>>>>>> * [AIRFLOW-2716] Replace async and await py3.7 keywords
>>>>>>>>>> * [AIRFLOW-2810] Fix typo in Xcom model timestamp
>>>>>>>>>> 
>>>>>>>>>> Please note that the version number excludes the
`rcX` string as well
>>>>>>>>>> as the "+incubating" string, so it's now simply 1.10.0.
This will allow us
>>>>>>>>>> to rename the artifact without modifying the artifact
checksums when we
>>>>>>>>>> actually release.
>>>>>>>>>> 
>>>>>>>>>> WARNING: Due to licensing requirements you will need
to set
>>>>>>>>>> SLUGIFY_USES_TEXT_UNIDECODE=yes in your environment
when
>>>>>>>>>> installing or upgrading. We will try to remove this
requirement for the
>>>>>>>>>> next release.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Bolke
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 


Mime
View raw message