tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyunsik Choi <hyun...@apache.org>
Subject Re: Timezone issues
Date Sun, 23 Nov 2014 14:30:18 GMT
(Actually, I had suffered from some mailing list rejection problem for
few days. This problem was caused by the change of committer shell
access policy. Finally, I fixed the problem. I'm not sure if you
received the following mail. So, I send it again.)

Hi Jaewoong,

Today, I investigated the problem in more deep. I found the problem
about misuse of timezone. Recently, I've also suffered from this
problem while I'm working tajo-client. Actually, the recap of datetime
types were contributed by one contributor, and I reviewed the codes.
Since I overlooked timezone-related parts, it's my fault consequently.
Anyway, we can fix it. I'm very happy for us to fix this problem with
your help.

The main cause of this problem is that Datum-level data structures
deal with timezone directly.

First of all, in this email, I'm going to give the background in more
detail; I think that this background includes many answers of your
questions. Then, I'll give the answers of your questions in next
email.

As you may know, In SQL, there are two kinds of data types: one is
without tiemzone and another is without timezone.

SQL standard types without timezone:
* Timestamp
* Time

SQL standard types with timezone:
* Timestampz
* Timez

Each timestampz or Timez value has timezone offset itself. Both types
are straightforward. Also, Date type does not have 'with timezone'
type.

BTW, 'types without timezone' (e.g., Timestamp and Time) can be
affected by an external timezone in almost all SQL systems that I've
seen. In other words, their timezone can be present in outside of data
values. By default, many systems use the timezone of local machine.
Similarly, Tajo uses a static variable 'TajoConf.CURRENT_TIMEZONE'.

The main issue is when data values are adjusted by timezone. The
adjustment should be performed before Datum is created. In other
words,  in scanner, Timestamp or Time value should be adjusted by
timezone. TimestampDatum or TimeDatum must be with UTC+0. But, in the
current implementation, some parts of TimestampDatum and TimeDatum
address directly timezone. It's wrong in my opinion. Also, all unit
tests for TimestampDatum and TimeDatum should be performed with UTC.

In addition, timestamp or time values should be adjusted one more by
'user timezone' when an user gets the query results. In sum, the flow
is as follows:

Timestamp or Time value in an Input table ---(adjust by an external
timezone or local timezone)---> TimestampDatum or TimeDatum with UTC
---> query processing  ---(adjust by an user timezone)-----> Print

The reason why TimestampDatum and TimeDaum values must be addressed in
UTC is the simplicity of processing codes. In many cases, two or more
tables used in one query can have multiple different timezones. If all
values are in UTC, the processing does not need to consider different
timezones. But, the current implementation has one global timezone. We
should fix it, and I created the jira issue for it at
https://issues.apache.org/jira/browse/TAJO-1186.

Best regards,
Hyunsik

On Fri, Nov 21, 2014 at 12:50 AM, Jaewoong Jung <jungjw@gmail.com> wrote:
> There's another issue that hopefully Hyunsik would be able to clarify,
> and it's very crucial to handling timezones in these data types.
>
> Q: So, let's say (and I agree) TimeDatum represents an instant, so can
> be timezoned. Then, is it a UTC time or a local time?
>
> Let me explain why this question is important.
>
> DateDatum represents an instant. And, it is implicitly timezoned to
> the user local time. Which means, if I use '11/20/2014', it's
> '11/20/2014 0:00:00 PST' and is equivalent to a TimestampDatum for
> '11/20/2014 8:00:00 UTC'. (BTW, the compareTo implementation for
> DateDatum is broken in this regard. I'll file a separate issue for
> that.)
>
> What about TimeDatum? It is currently timezoned to UTC, (surprise,
> anyone?) if I understood the code correctly. When we add a TimeDatum
> to a DateDatum, we convert the TimeDatum to the user local time, which
> implies TimeDatum is UTC-based. (Also, the comment next to the line
> explicitly mentions it.)
>
> So, here's the problem.
>
> Why do they have different timezones? They're incomplete as an instant
> when used alone and are complementary to each other. This is an
> important concept. To understand it, you have to think about why
> adding DateDatum and TimeDatum is allowed in the code. Originally,
> instants can't be added. (And, that's why I thought TimeDatum is not
> an instant.) You can't add (say) 11/20/2014 to 11/23/2019. Subtracting
> an instant from an instant makes sense and results in a period, but
> they can't be added.
>
> However, in the case of DateDatum and TimeDatum, additions are allowed
> because they're complementary to each other, and what the code does
> **conceptually** is concatenate the two.
>
> Therefore, because they're intended to be used together, I'd argue
> they shouldn't have different timezones. Also, if they have different
> timezones, additions can't have a simple correct answer. What's the
> correct answer of 11/20/2014 (PST) + 8:00:00 (UTC)? There's no clear
> answer because they can't be simply concatenated.
>
> (FWIW, the current Tajo code thinks the answer is 11/20/2014 8:00:00
> in UTC. How many of you got it right?)
>
> This can cause a lot of confusion to users. When they use a date
> alone, it is interpreted as a local time. But, as soon as they add a
> time to it, it is silently converted to UTC in a way which is very
> unexpected to many users.
>
> Why am I emphasizing it is unexpected? Look at the comparison below.
>
> What's the answer to 11/20/2014 (DateDatum) + 8 hours (IntervalDatum)?
> It's 11/20/2014 8:00 in a local time (PST on my machine). How about
> 11/20/2014 (DateDatum) + 8:00:00 (TimeDatum)? It's 11/20/2014 8:00 in
> UTC as I wrote above. How many users would be able to expect this?
>
> So, coming back to my original question. What timezone should a
> TimeDatum have? UTC or local time? It's currently UTC. But, I believe
> it should be changed to the local time zone.
>
> (Sorry for the long email. But, I think it's critical to get this
> right and build consensus over it so that we can provide a consistent
> behavior going forward. The actual fix will be really simple, though.)
>
> On Thu, Nov 20, 2014 at 2:44 AM, Hyunsik Choi <hyunsik.choi@gmail.com> wrote:
>> Those parts have poor documentation.
>>
>> I agree with your investigation. I also could find many misuse of timezone
>> in many parts. We should make them clear and fix them in this chance.
>>
>> I just got off the plane, and I'm still on the road. So, I'll give more
>> comments tomorrow.
>>
>> - hyunsik
>>
>> On Thu, Nov 20, 2014 at 5:24 PM, Jaewoong Jung <jungjw@gmail.com> wrote:
>>
>>> Wow, this seemingly trivial issue has surprisingly many problems involved.
>>>
>>> The most critical one, though, is TimeMeta class. Presumably because
>>> it is poorly documented, it is being used for two different purposes
>>> in Tajo code base. Some code treats it as a date time representation,
>>> which I believe is the original intention, but some treat it as the
>>> human-readable equivalent of TimeDatum by completely ignoring
>>> date-related fields.
>>>
>>> For example, DateTimeUtil.date2j(long julianDate, TimeMeta tm), which
>>> converts a julian timestamp to a TimeMeta value, doesn't touch
>>> dayOfMonth, monthOfYear, or years values and just puts all values for
>>> hour and above units in hours field.
>>>
>>> There are other minor problems like incorrect comments and absent
>>> default values, but the most critical one is misuse of TimeMeta.
>>>
>>> I'll try to break up my patches so that each has a clear and
>>> easy-to-understand goal. Sending this heads-up to let you know
>>> there'll be more issues filed and patches sent than you might expect.
>>>
>>> On Wed, Nov 19, 2014 at 10:56 PM, Jaewoong Jung <jungjw@gmail.com> wrote:
>>> > Yeah, after some more research, I found that TImeDatum is a somewhat
>>> > ambiguous data type.
>>> >
>>> > Its original purpose is to represent a time of a day, i.e. hh:mm:ss
>>> > part of an instant. So, it could be viewed as instant data though some
>>> > may argue it's incomplete to fully represent an instant. Anyway, given
>>> > that TimestampDatum has limitation in terms of the time range it can
>>> > represent, and given that Tajo doesn't have DateTime data type,
>>> > clients should be allowed to use it with a timezone.
>>> >
>>> > I'll change the direction and try to address the issue by fixing the
>>> > underflow error.
>>> >
>>> > Thanks for the input. :)
>>> >
>>> > On Wed, Nov 19, 2014 at 10:24 PM, Jong-young Park <eminency@gmail.com>
>>> wrote:
>>> >> Hi, Jaewoong.
>>> >>
>>> >> To express time period value, IntervalDatum is existing as I know.
>>> >>
>>> >> So I think it is right that TimeDatum is for some time point.
>>> >>
>>> >> And TimestampDatum seems it is doing both roles of DateDatum and
>>> TimeDatum.
>>> >>
>>> >> Regards,
>>> >> Jongyoung.
>>> >>
>>> >> On Wed Nov 19 2014 at 오후 5:23:40 Jaewoong Jung <jungjw@gmail.com>
>>> wrote:
>>> >>
>>> >>> It turns out, TAJO-1191 is slightly more complicated than I thought.
>>> >>> (https://issues.apache.org/jira/browse/TAJO-1191)
>>> >>>
>>> >>> Basically, it's about whether TimeDatum may have a timezone tied
with
>>> >>> it. I **believe** TimeDatum is originally designed to hold a time
>>> >>> period value, not an instant. (TimestampDatum seems to be the
>>> >>> canonical container for instants.) So, it doesn't make sense to
apply
>>> >>> any timezones to TimeDatum values, but it's being done in a few
>>> >>> places. And, that's why the test is failing on my machines.
>>> >>>
>>> >>> I'm going to try to fix it by removing all timezone-related references
>>> >>> around the class, but I want to check my assumption with you before
I
>>> >>> proceed.
>>> >>>
>>> >>> What do you think about it?
>>> >>>
>>>

Mime
View raw message