tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaewoong Jung <jun...@gmail.com>
Subject Re: Timezone issues
Date Tue, 25 Nov 2014 07:17:14 GMT
Everything is very clear now thanks to your explanation. :)

Okay, then I'll fix the issue by making DateDatum timezone-neutral and
TimeDatum UTC-based. Also, I'll play with PostgreSQL to understand its
timezone model better.

Meanwhile, please feel free to assign timezone-related bugs to me as
you see fit.

On Mon, Nov 24, 2014 at 5:52 PM, Hyunsik Choi <hyunsik@apache.org> wrote:
> Thank you all guys for your comments.
>
> Jaewoong,
>
> I leave inline comments. If my answers are not enough for your
> question or I misunderstood your question, please feel free to ask
> additional questions.
>
> Best regards,
> Hyunsik
>
> On Fri, Nov 21, 2014 at 12:50 AM, Jaewoong Jung <jungjw@gmail.com> wrote:
>> There's another issue that hopefully Hyunsik would be able to clarify,
>> and it's very crucial to handling timezones in these data types.
>>
>> Q: So, let's say (and I agree) TimeDatum represents an instant, so can
>> be timezoned. Then, is it a UTC time or a local time?
>>
>
> TimeDatum is UTC or should be UTC if some parts are not.
>
>> Let me explain why this question is important.
>>
>> DateDatum represents an instant. And, it is implicitly timezoned to
>> the user local time. Which means, if I use '11/20/2014', it's
>> '11/20/2014 0:00:00 PST' and is equivalent to a TimestampDatum for
>> '11/20/2014 8:00:00 UTC'. (BTW, the compareTo implementation for
>> DateDatum is broken in this regard. I'll file a separate issue for
>> that.)
>
> I agree with your proposal. I also think that DateDatum does not need
> to be timezoned. We should keep it as DateDatum instead of
> TimestampDatum. Thank you for nice finding!
>
>> What about TimeDatum? It is currently timezoned to UTC, (surprise,
>> anyone?) if I understood the code correctly. When we add a TimeDatum
>> to a DateDatum, we convert the TimeDatum to the user local time, which
>> implies TimeDatum is UTC-based. (Also, the comment next to the line
>> explicitly mentions it.)
>
> You are right. TimeDatum represents UTC time.
>
> FYI, I'd like to describe additional background. There are only two
> entry points to take time or timestamp values. One is records in input
> tables, and another is SQL statements. Currently, Input table uses the
> system global timezone specified in TajoConf (tajo-site.xml file).
> Later, we will add one table property to allow users to specify
> timezone for each table. For SQL statement (e.g., SELECT time
> '03:00:00'), we will use client timezone. Also, we will provide some
> expression to allow users to specify timezone for time or timestamp in
> SQL statements.
>
> Consequently, only two entry points have to deal with timezone for
> Timestamp and Time. Other parts in Tajo should deal with all values in
> UTC.
>
>>
>> So, here's the problem.
>>
>> Why do they have different timezones? They're incomplete as an instant
>> when used alone and are complementary to each other. This is an
>> important concept. To understand it, you have to think about why
>> adding DateDatum and TimeDatum is allowed in the code. Originally,
>> instants can't be added. (And, that's why I thought TimeDatum is not
>> an instant.) You can't add (say) 11/20/2014 to 11/23/2019. Subtracting
>> an instant from an instant makes sense and results in a period, but
>> they can't be added.
>>
>> However, in the case of DateDatum and TimeDatum, additions are allowed
>> because they're complementary to each other, and what the code does
>> **conceptually** is concatenate the two.
>
> Great insight! So far, I haven't thought it.
>
>> Therefore, because they're intended to be used together, I'd argue
>> they shouldn't have different timezones. Also, if they have different
>> timezones, additions can't have a simple correct answer. What's the
>> correct answer of 11/20/2014 (PST) + 8:00:00 (UTC)? There's no clear
>> answer because they can't be simply concatenated.
>>
>> (FWIW, the current Tajo code thinks the answer is 11/20/2014 8:00:00
>> in UTC. How many of you got it right?)
>
> You are definitely right :) If they have different timezone, the
> problem becomes very complicated. Nobody wants it :) As I mentioned,
> Timezone problem of Timestamp and Time data types should be addressed
> in two entry points and client. We need to keep the processing
> approach simple.
>
>> This can cause a lot of confusion to users. When they use a date
>> alone, it is interpreted as a local time. But, as soon as they add a
>> time to it, it is silently converted to UTC in a way which is very
>> unexpected to many users.
>>
>> Why am I emphasizing it is unexpected? Look at the comparison below.
>>
>> What's the answer to 11/20/2014 (DateDatum) + 8 hours (IntervalDatum)?
>> It's 11/20/2014 8:00 in a local time (PST on my machine). How about
>> 11/20/2014 (DateDatum) + 8:00:00 (TimeDatum)? It's 11/20/2014 8:00 in
>> UTC as I wrote above. How many users would be able to expect this?
>
> They are definitely a bug. We follow PostgreSQL in all aspects, and
> the following results come from the PostgreSQL. The results of two
> operations are the same.
>
> hyunsik=> SELECT date '11/20/2014' + time '08:00';
>       ?column?
> ---------------------
>  2014-11-20 08:00:00
> (1 row)
>
> hyunsik=> select date '11/20/2014' + interval '8 hrs';
>       ?column?
> ---------------------
>  2014-11-20 08:00:00
> (1 row)
>
>
>>
>> So, coming back to my original question. What timezone should a
>> TimeDatum have? UTC or local time? It's currently UTC. But, I believe
>> it should be changed to the local time zone.
>
> TimeDatum is UTC.
>
> In sum, we should keep both TimeDatum and TimestampDatum UTC values.
> Then, we should address timezone offsets in two entry points and
> client side.
>
>>
>> (Sorry for the long email. But, I think it's critical to get this
>> right and build consensus over it so that we can provide a consistent
>> behavior going forward. The actual fix will be really simple, though.)
>>
>> On Thu, Nov 20, 2014 at 2:44 AM, Hyunsik Choi <hyunsik.choi@gmail.com> wrote:
>>> Those parts have poor documentation.
>>>
>>> I agree with your investigation. I also could find many misuse of timezone
>>> in many parts. We should make them clear and fix them in this chance.
>>>
>>> I just got off the plane, and I'm still on the road. So, I'll give more
>>> comments tomorrow.
>>>
>>> - hyunsik
>>>
>>> On Thu, Nov 20, 2014 at 5:24 PM, Jaewoong Jung <jungjw@gmail.com> wrote:
>>>
>>>> Wow, this seemingly trivial issue has surprisingly many problems involved.
>>>>
>>>> The most critical one, though, is TimeMeta class. Presumably because
>>>> it is poorly documented, it is being used for two different purposes
>>>> in Tajo code base. Some code treats it as a date time representation,
>>>> which I believe is the original intention, but some treat it as the
>>>> human-readable equivalent of TimeDatum by completely ignoring
>>>> date-related fields.
>>>>
>>>> For example, DateTimeUtil.date2j(long julianDate, TimeMeta tm), which
>>>> converts a julian timestamp to a TimeMeta value, doesn't touch
>>>> dayOfMonth, monthOfYear, or years values and just puts all values for
>>>> hour and above units in hours field.
>>>>
>>>> There are other minor problems like incorrect comments and absent
>>>> default values, but the most critical one is misuse of TimeMeta.
>>>>
>>>> I'll try to break up my patches so that each has a clear and
>>>> easy-to-understand goal. Sending this heads-up to let you know
>>>> there'll be more issues filed and patches sent than you might expect.
>>>>
>>>> On Wed, Nov 19, 2014 at 10:56 PM, Jaewoong Jung <jungjw@gmail.com>
wrote:
>>>> > Yeah, after some more research, I found that TImeDatum is a somewhat
>>>> > ambiguous data type.
>>>> >
>>>> > Its original purpose is to represent a time of a day, i.e. hh:mm:ss
>>>> > part of an instant. So, it could be viewed as instant data though some
>>>> > may argue it's incomplete to fully represent an instant. Anyway, given
>>>> > that TimestampDatum has limitation in terms of the time range it can
>>>> > represent, and given that Tajo doesn't have DateTime data type,
>>>> > clients should be allowed to use it with a timezone.
>>>> >
>>>> > I'll change the direction and try to address the issue by fixing the
>>>> > underflow error.
>>>> >
>>>> > Thanks for the input. :)
>>>> >
>>>> > On Wed, Nov 19, 2014 at 10:24 PM, Jong-young Park <eminency@gmail.com>
>>>> wrote:
>>>> >> Hi, Jaewoong.
>>>> >>
>>>> >> To express time period value, IntervalDatum is existing as I know.
>>>> >>
>>>> >> So I think it is right that TimeDatum is for some time point.
>>>> >>
>>>> >> And TimestampDatum seems it is doing both roles of DateDatum and
>>>> TimeDatum.
>>>> >>
>>>> >> Regards,
>>>> >> Jongyoung.
>>>> >>
>>>> >> On Wed Nov 19 2014 at 오후 5:23:40 Jaewoong Jung <jungjw@gmail.com>
>>>> wrote:
>>>> >>
>>>> >>> It turns out, TAJO-1191 is slightly more complicated than I
thought.
>>>> >>> (https://issues.apache.org/jira/browse/TAJO-1191)
>>>> >>>
>>>> >>> Basically, it's about whether TimeDatum may have a timezone
tied with
>>>> >>> it. I **believe** TimeDatum is originally designed to hold a
time
>>>> >>> period value, not an instant. (TimestampDatum seems to be the
>>>> >>> canonical container for instants.) So, it doesn't make sense
to apply
>>>> >>> any timezones to TimeDatum values, but it's being done in a
few
>>>> >>> places. And, that's why the test is failing on my machines.
>>>> >>>
>>>> >>> I'm going to try to fix it by removing all timezone-related
references
>>>> >>> around the class, but I want to check my assumption with you
before I
>>>> >>> proceed.
>>>> >>>
>>>> >>> What do you think about it?
>>>> >>>
>>>>

Mime
View raw message