tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyunsik Choi <hyun...@apache.org>
Subject Re: Timezone issues
Date Tue, 25 Nov 2014 01:52:09 GMT
Thank you all guys for your comments.

Jaewoong,

I leave inline comments. If my answers are not enough for your
question or I misunderstood your question, please feel free to ask
additional questions.

Best regards,
Hyunsik

On Fri, Nov 21, 2014 at 12:50 AM, Jaewoong Jung <jungjw@gmail.com> wrote:
> There's another issue that hopefully Hyunsik would be able to clarify,
> and it's very crucial to handling timezones in these data types.
>
> Q: So, let's say (and I agree) TimeDatum represents an instant, so can
> be timezoned. Then, is it a UTC time or a local time?
>

TimeDatum is UTC or should be UTC if some parts are not.

> Let me explain why this question is important.
>
> DateDatum represents an instant. And, it is implicitly timezoned to
> the user local time. Which means, if I use '11/20/2014', it's
> '11/20/2014 0:00:00 PST' and is equivalent to a TimestampDatum for
> '11/20/2014 8:00:00 UTC'. (BTW, the compareTo implementation for
> DateDatum is broken in this regard. I'll file a separate issue for
> that.)

I agree with your proposal. I also think that DateDatum does not need
to be timezoned. We should keep it as DateDatum instead of
TimestampDatum. Thank you for nice finding!

> What about TimeDatum? It is currently timezoned to UTC, (surprise,
> anyone?) if I understood the code correctly. When we add a TimeDatum
> to a DateDatum, we convert the TimeDatum to the user local time, which
> implies TimeDatum is UTC-based. (Also, the comment next to the line
> explicitly mentions it.)

You are right. TimeDatum represents UTC time.

FYI, I'd like to describe additional background. There are only two
entry points to take time or timestamp values. One is records in input
tables, and another is SQL statements. Currently, Input table uses the
system global timezone specified in TajoConf (tajo-site.xml file).
Later, we will add one table property to allow users to specify
timezone for each table. For SQL statement (e.g., SELECT time
'03:00:00'), we will use client timezone. Also, we will provide some
expression to allow users to specify timezone for time or timestamp in
SQL statements.

Consequently, only two entry points have to deal with timezone for
Timestamp and Time. Other parts in Tajo should deal with all values in
UTC.

>
> So, here's the problem.
>
> Why do they have different timezones? They're incomplete as an instant
> when used alone and are complementary to each other. This is an
> important concept. To understand it, you have to think about why
> adding DateDatum and TimeDatum is allowed in the code. Originally,
> instants can't be added. (And, that's why I thought TimeDatum is not
> an instant.) You can't add (say) 11/20/2014 to 11/23/2019. Subtracting
> an instant from an instant makes sense and results in a period, but
> they can't be added.
>
> However, in the case of DateDatum and TimeDatum, additions are allowed
> because they're complementary to each other, and what the code does
> **conceptually** is concatenate the two.

Great insight! So far, I haven't thought it.

> Therefore, because they're intended to be used together, I'd argue
> they shouldn't have different timezones. Also, if they have different
> timezones, additions can't have a simple correct answer. What's the
> correct answer of 11/20/2014 (PST) + 8:00:00 (UTC)? There's no clear
> answer because they can't be simply concatenated.
>
> (FWIW, the current Tajo code thinks the answer is 11/20/2014 8:00:00
> in UTC. How many of you got it right?)

You are definitely right :) If they have different timezone, the
problem becomes very complicated. Nobody wants it :) As I mentioned,
Timezone problem of Timestamp and Time data types should be addressed
in two entry points and client. We need to keep the processing
approach simple.

> This can cause a lot of confusion to users. When they use a date
> alone, it is interpreted as a local time. But, as soon as they add a
> time to it, it is silently converted to UTC in a way which is very
> unexpected to many users.
>
> Why am I emphasizing it is unexpected? Look at the comparison below.
>
> What's the answer to 11/20/2014 (DateDatum) + 8 hours (IntervalDatum)?
> It's 11/20/2014 8:00 in a local time (PST on my machine). How about
> 11/20/2014 (DateDatum) + 8:00:00 (TimeDatum)? It's 11/20/2014 8:00 in
> UTC as I wrote above. How many users would be able to expect this?

They are definitely a bug. We follow PostgreSQL in all aspects, and
the following results come from the PostgreSQL. The results of two
operations are the same.

hyunsik=> SELECT date '11/20/2014' + time '08:00';
      ?column?
---------------------
 2014-11-20 08:00:00
(1 row)

hyunsik=> select date '11/20/2014' + interval '8 hrs';
      ?column?
---------------------
 2014-11-20 08:00:00
(1 row)


>
> So, coming back to my original question. What timezone should a
> TimeDatum have? UTC or local time? It's currently UTC. But, I believe
> it should be changed to the local time zone.

TimeDatum is UTC.

In sum, we should keep both TimeDatum and TimestampDatum UTC values.
Then, we should address timezone offsets in two entry points and
client side.

>
> (Sorry for the long email. But, I think it's critical to get this
> right and build consensus over it so that we can provide a consistent
> behavior going forward. The actual fix will be really simple, though.)
>
> On Thu, Nov 20, 2014 at 2:44 AM, Hyunsik Choi <hyunsik.choi@gmail.com> wrote:
>> Those parts have poor documentation.
>>
>> I agree with your investigation. I also could find many misuse of timezone
>> in many parts. We should make them clear and fix them in this chance.
>>
>> I just got off the plane, and I'm still on the road. So, I'll give more
>> comments tomorrow.
>>
>> - hyunsik
>>
>> On Thu, Nov 20, 2014 at 5:24 PM, Jaewoong Jung <jungjw@gmail.com> wrote:
>>
>>> Wow, this seemingly trivial issue has surprisingly many problems involved.
>>>
>>> The most critical one, though, is TimeMeta class. Presumably because
>>> it is poorly documented, it is being used for two different purposes
>>> in Tajo code base. Some code treats it as a date time representation,
>>> which I believe is the original intention, but some treat it as the
>>> human-readable equivalent of TimeDatum by completely ignoring
>>> date-related fields.
>>>
>>> For example, DateTimeUtil.date2j(long julianDate, TimeMeta tm), which
>>> converts a julian timestamp to a TimeMeta value, doesn't touch
>>> dayOfMonth, monthOfYear, or years values and just puts all values for
>>> hour and above units in hours field.
>>>
>>> There are other minor problems like incorrect comments and absent
>>> default values, but the most critical one is misuse of TimeMeta.
>>>
>>> I'll try to break up my patches so that each has a clear and
>>> easy-to-understand goal. Sending this heads-up to let you know
>>> there'll be more issues filed and patches sent than you might expect.
>>>
>>> On Wed, Nov 19, 2014 at 10:56 PM, Jaewoong Jung <jungjw@gmail.com> wrote:
>>> > Yeah, after some more research, I found that TImeDatum is a somewhat
>>> > ambiguous data type.
>>> >
>>> > Its original purpose is to represent a time of a day, i.e. hh:mm:ss
>>> > part of an instant. So, it could be viewed as instant data though some
>>> > may argue it's incomplete to fully represent an instant. Anyway, given
>>> > that TimestampDatum has limitation in terms of the time range it can
>>> > represent, and given that Tajo doesn't have DateTime data type,
>>> > clients should be allowed to use it with a timezone.
>>> >
>>> > I'll change the direction and try to address the issue by fixing the
>>> > underflow error.
>>> >
>>> > Thanks for the input. :)
>>> >
>>> > On Wed, Nov 19, 2014 at 10:24 PM, Jong-young Park <eminency@gmail.com>
>>> wrote:
>>> >> Hi, Jaewoong.
>>> >>
>>> >> To express time period value, IntervalDatum is existing as I know.
>>> >>
>>> >> So I think it is right that TimeDatum is for some time point.
>>> >>
>>> >> And TimestampDatum seems it is doing both roles of DateDatum and
>>> TimeDatum.
>>> >>
>>> >> Regards,
>>> >> Jongyoung.
>>> >>
>>> >> On Wed Nov 19 2014 at 오후 5:23:40 Jaewoong Jung <jungjw@gmail.com>
>>> wrote:
>>> >>
>>> >>> It turns out, TAJO-1191 is slightly more complicated than I thought.
>>> >>> (https://issues.apache.org/jira/browse/TAJO-1191)
>>> >>>
>>> >>> Basically, it's about whether TimeDatum may have a timezone tied
with
>>> >>> it. I **believe** TimeDatum is originally designed to hold a time
>>> >>> period value, not an instant. (TimestampDatum seems to be the
>>> >>> canonical container for instants.) So, it doesn't make sense to
apply
>>> >>> any timezones to TimeDatum values, but it's being done in a few
>>> >>> places. And, that's why the test is failing on my machines.
>>> >>>
>>> >>> I'm going to try to fix it by removing all timezone-related references
>>> >>> around the class, but I want to check my assumption with you before
I
>>> >>> proceed.
>>> >>>
>>> >>> What do you think about it?
>>> >>>
>>>

Mime
View raw message