tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyunsik Choi <hyun...@apache.org>
Subject Re: Timezone issues
Date Fri, 12 Dec 2014 07:08:52 GMT
Time zone issues seem to be completely fixed. I also added the documentation.
http://tajo.apache.org/docs/devel/time_zone.html

Best regards,
Hyunsik

On Mon, Dec 8, 2014 at 10:52 AM, Hyunsik Choi <hyunsik@apache.org> wrote:
> Hi Jaewoong and CharSyam,
>
> I submitted the patch that fixes timezone-related problems discussed
> here. I rearranged all time zone usages throughout Tajo.
> https://issues.apache.org/jira/browse/TAJO-1234
>
> Thanks to the discussion of this mailing list with you guys, I was
> clear enough to find or fix the problems.
>
> Warm regards,
> Hyunsik
>
> On Tue, Nov 25, 2014 at 5:54 PM, Hyunsik Choi <hyunsik@apache.org> wrote:
>> I'm happy to work with you and fix a major nuisance :) Later, I'll
>> share the timezone related problem with you when I found additional
>> bugs.
>>
>> Warm regards,
>> Hyunsik
>>
>> On Tue, Nov 25, 2014 at 4:17 PM, Jaewoong Jung <jungjw@gmail.com> wrote:
>>> Everything is very clear now thanks to your explanation. :)
>>>
>>> Okay, then I'll fix the issue by making DateDatum timezone-neutral and
>>> TimeDatum UTC-based. Also, I'll play with PostgreSQL to understand its
>>> timezone model better.
>>>
>>> Meanwhile, please feel free to assign timezone-related bugs to me as
>>> you see fit.
>>>
>>> On Mon, Nov 24, 2014 at 5:52 PM, Hyunsik Choi <hyunsik@apache.org> wrote:
>>>> Thank you all guys for your comments.
>>>>
>>>> Jaewoong,
>>>>
>>>> I leave inline comments. If my answers are not enough for your
>>>> question or I misunderstood your question, please feel free to ask
>>>> additional questions.
>>>>
>>>> Best regards,
>>>> Hyunsik
>>>>
>>>> On Fri, Nov 21, 2014 at 12:50 AM, Jaewoong Jung <jungjw@gmail.com>
wrote:
>>>>> There's another issue that hopefully Hyunsik would be able to clarify,
>>>>> and it's very crucial to handling timezones in these data types.
>>>>>
>>>>> Q: So, let's say (and I agree) TimeDatum represents an instant, so can
>>>>> be timezoned. Then, is it a UTC time or a local time?
>>>>>
>>>>
>>>> TimeDatum is UTC or should be UTC if some parts are not.
>>>>
>>>>> Let me explain why this question is important.
>>>>>
>>>>> DateDatum represents an instant. And, it is implicitly timezoned to
>>>>> the user local time. Which means, if I use '11/20/2014', it's
>>>>> '11/20/2014 0:00:00 PST' and is equivalent to a TimestampDatum for
>>>>> '11/20/2014 8:00:00 UTC'. (BTW, the compareTo implementation for
>>>>> DateDatum is broken in this regard. I'll file a separate issue for
>>>>> that.)
>>>>
>>>> I agree with your proposal. I also think that DateDatum does not need
>>>> to be timezoned. We should keep it as DateDatum instead of
>>>> TimestampDatum. Thank you for nice finding!
>>>>
>>>>> What about TimeDatum? It is currently timezoned to UTC, (surprise,
>>>>> anyone?) if I understood the code correctly. When we add a TimeDatum
>>>>> to a DateDatum, we convert the TimeDatum to the user local time, which
>>>>> implies TimeDatum is UTC-based. (Also, the comment next to the line
>>>>> explicitly mentions it.)
>>>>
>>>> You are right. TimeDatum represents UTC time.
>>>>
>>>> FYI, I'd like to describe additional background. There are only two
>>>> entry points to take time or timestamp values. One is records in input
>>>> tables, and another is SQL statements. Currently, Input table uses the
>>>> system global timezone specified in TajoConf (tajo-site.xml file).
>>>> Later, we will add one table property to allow users to specify
>>>> timezone for each table. For SQL statement (e.g., SELECT time
>>>> '03:00:00'), we will use client timezone. Also, we will provide some
>>>> expression to allow users to specify timezone for time or timestamp in
>>>> SQL statements.
>>>>
>>>> Consequently, only two entry points have to deal with timezone for
>>>> Timestamp and Time. Other parts in Tajo should deal with all values in
>>>> UTC.
>>>>
>>>>>
>>>>> So, here's the problem.
>>>>>
>>>>> Why do they have different timezones? They're incomplete as an instant
>>>>> when used alone and are complementary to each other. This is an
>>>>> important concept. To understand it, you have to think about why
>>>>> adding DateDatum and TimeDatum is allowed in the code. Originally,
>>>>> instants can't be added. (And, that's why I thought TimeDatum is not
>>>>> an instant.) You can't add (say) 11/20/2014 to 11/23/2019. Subtracting
>>>>> an instant from an instant makes sense and results in a period, but
>>>>> they can't be added.
>>>>>
>>>>> However, in the case of DateDatum and TimeDatum, additions are allowed
>>>>> because they're complementary to each other, and what the code does
>>>>> **conceptually** is concatenate the two.
>>>>
>>>> Great insight! So far, I haven't thought it.
>>>>
>>>>> Therefore, because they're intended to be used together, I'd argue
>>>>> they shouldn't have different timezones. Also, if they have different
>>>>> timezones, additions can't have a simple correct answer. What's the
>>>>> correct answer of 11/20/2014 (PST) + 8:00:00 (UTC)? There's no clear
>>>>> answer because they can't be simply concatenated.
>>>>>
>>>>> (FWIW, the current Tajo code thinks the answer is 11/20/2014 8:00:00
>>>>> in UTC. How many of you got it right?)
>>>>
>>>> You are definitely right :) If they have different timezone, the
>>>> problem becomes very complicated. Nobody wants it :) As I mentioned,
>>>> Timezone problem of Timestamp and Time data types should be addressed
>>>> in two entry points and client. We need to keep the processing
>>>> approach simple.
>>>>
>>>>> This can cause a lot of confusion to users. When they use a date
>>>>> alone, it is interpreted as a local time. But, as soon as they add a
>>>>> time to it, it is silently converted to UTC in a way which is very
>>>>> unexpected to many users.
>>>>>
>>>>> Why am I emphasizing it is unexpected? Look at the comparison below.
>>>>>
>>>>> What's the answer to 11/20/2014 (DateDatum) + 8 hours (IntervalDatum)?
>>>>> It's 11/20/2014 8:00 in a local time (PST on my machine). How about
>>>>> 11/20/2014 (DateDatum) + 8:00:00 (TimeDatum)? It's 11/20/2014 8:00 in
>>>>> UTC as I wrote above. How many users would be able to expect this?
>>>>
>>>> They are definitely a bug. We follow PostgreSQL in all aspects, and
>>>> the following results come from the PostgreSQL. The results of two
>>>> operations are the same.
>>>>
>>>> hyunsik=> SELECT date '11/20/2014' + time '08:00';
>>>>       ?column?
>>>> ---------------------
>>>>  2014-11-20 08:00:00
>>>> (1 row)
>>>>
>>>> hyunsik=> select date '11/20/2014' + interval '8 hrs';
>>>>       ?column?
>>>> ---------------------
>>>>  2014-11-20 08:00:00
>>>> (1 row)
>>>>
>>>>
>>>>>
>>>>> So, coming back to my original question. What timezone should a
>>>>> TimeDatum have? UTC or local time? It's currently UTC. But, I believe
>>>>> it should be changed to the local time zone.
>>>>
>>>> TimeDatum is UTC.
>>>>
>>>> In sum, we should keep both TimeDatum and TimestampDatum UTC values.
>>>> Then, we should address timezone offsets in two entry points and
>>>> client side.
>>>>
>>>>>
>>>>> (Sorry for the long email. But, I think it's critical to get this
>>>>> right and build consensus over it so that we can provide a consistent
>>>>> behavior going forward. The actual fix will be really simple, though.)
>>>>>
>>>>> On Thu, Nov 20, 2014 at 2:44 AM, Hyunsik Choi <hyunsik.choi@gmail.com>
wrote:
>>>>>> Those parts have poor documentation.
>>>>>>
>>>>>> I agree with your investigation. I also could find many misuse of
timezone
>>>>>> in many parts. We should make them clear and fix them in this chance.
>>>>>>
>>>>>> I just got off the plane, and I'm still on the road. So, I'll give
more
>>>>>> comments tomorrow.
>>>>>>
>>>>>> - hyunsik
>>>>>>
>>>>>> On Thu, Nov 20, 2014 at 5:24 PM, Jaewoong Jung <jungjw@gmail.com>
wrote:
>>>>>>
>>>>>>> Wow, this seemingly trivial issue has surprisingly many problems
involved.
>>>>>>>
>>>>>>> The most critical one, though, is TimeMeta class. Presumably
because
>>>>>>> it is poorly documented, it is being used for two different purposes
>>>>>>> in Tajo code base. Some code treats it as a date time representation,
>>>>>>> which I believe is the original intention, but some treat it
as the
>>>>>>> human-readable equivalent of TimeDatum by completely ignoring
>>>>>>> date-related fields.
>>>>>>>
>>>>>>> For example, DateTimeUtil.date2j(long julianDate, TimeMeta tm),
which
>>>>>>> converts a julian timestamp to a TimeMeta value, doesn't touch
>>>>>>> dayOfMonth, monthOfYear, or years values and just puts all values
for
>>>>>>> hour and above units in hours field.
>>>>>>>
>>>>>>> There are other minor problems like incorrect comments and absent
>>>>>>> default values, but the most critical one is misuse of TimeMeta.
>>>>>>>
>>>>>>> I'll try to break up my patches so that each has a clear and
>>>>>>> easy-to-understand goal. Sending this heads-up to let you know
>>>>>>> there'll be more issues filed and patches sent than you might
expect.
>>>>>>>
>>>>>>> On Wed, Nov 19, 2014 at 10:56 PM, Jaewoong Jung <jungjw@gmail.com>
wrote:
>>>>>>> > Yeah, after some more research, I found that TImeDatum is
a somewhat
>>>>>>> > ambiguous data type.
>>>>>>> >
>>>>>>> > Its original purpose is to represent a time of a day, i.e.
hh:mm:ss
>>>>>>> > part of an instant. So, it could be viewed as instant data
though some
>>>>>>> > may argue it's incomplete to fully represent an instant.
Anyway, given
>>>>>>> > that TimestampDatum has limitation in terms of the time
range it can
>>>>>>> > represent, and given that Tajo doesn't have DateTime data
type,
>>>>>>> > clients should be allowed to use it with a timezone.
>>>>>>> >
>>>>>>> > I'll change the direction and try to address the issue by
fixing the
>>>>>>> > underflow error.
>>>>>>> >
>>>>>>> > Thanks for the input. :)
>>>>>>> >
>>>>>>> > On Wed, Nov 19, 2014 at 10:24 PM, Jong-young Park <eminency@gmail.com>
>>>>>>> wrote:
>>>>>>> >> Hi, Jaewoong.
>>>>>>> >>
>>>>>>> >> To express time period value, IntervalDatum is existing
as I know.
>>>>>>> >>
>>>>>>> >> So I think it is right that TimeDatum is for some time
point.
>>>>>>> >>
>>>>>>> >> And TimestampDatum seems it is doing both roles of DateDatum
and
>>>>>>> TimeDatum.
>>>>>>> >>
>>>>>>> >> Regards,
>>>>>>> >> Jongyoung.
>>>>>>> >>
>>>>>>> >> On Wed Nov 19 2014 at 오후 5:23:40 Jaewoong Jung <jungjw@gmail.com>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >>> It turns out, TAJO-1191 is slightly more complicated
than I thought.
>>>>>>> >>> (https://issues.apache.org/jira/browse/TAJO-1191)
>>>>>>> >>>
>>>>>>> >>> Basically, it's about whether TimeDatum may have
a timezone tied with
>>>>>>> >>> it. I **believe** TimeDatum is originally designed
to hold a time
>>>>>>> >>> period value, not an instant. (TimestampDatum seems
to be the
>>>>>>> >>> canonical container for instants.) So, it doesn't
make sense to apply
>>>>>>> >>> any timezones to TimeDatum values, but it's being
done in a few
>>>>>>> >>> places. And, that's why the test is failing on my
machines.
>>>>>>> >>>
>>>>>>> >>> I'm going to try to fix it by removing all timezone-related
references
>>>>>>> >>> around the class, but I want to check my assumption
with you before I
>>>>>>> >>> proceed.
>>>>>>> >>>
>>>>>>> >>> What do you think about it?
>>>>>>> >>>
>>>>>>>

Mime
View raw message