tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CharSyam <chars...@gmail.com>
Subject Re: Timezone issues
Date Sun, 23 Nov 2014 14:42:03 GMT
@hyunsik, It's good new. :)

2014-11-23 23:30 GMT+09:00 Hyunsik Choi <hyunsik@apache.org>:

> (Actually, I had suffered from some mailing list rejection problem for
> few days. This problem was caused by the change of committer shell
> access policy. Finally, I fixed the problem. I'm not sure if you
> received the following mail. So, I send it again.)
>
> Hi Jaewoong,
>
> Today, I investigated the problem in more deep. I found the problem
> about misuse of timezone. Recently, I've also suffered from this
> problem while I'm working tajo-client. Actually, the recap of datetime
> types were contributed by one contributor, and I reviewed the codes.
> Since I overlooked timezone-related parts, it's my fault consequently.
> Anyway, we can fix it. I'm very happy for us to fix this problem with
> your help.
>
> The main cause of this problem is that Datum-level data structures
> deal with timezone directly.
>
> First of all, in this email, I'm going to give the background in more
> detail; I think that this background includes many answers of your
> questions. Then, I'll give the answers of your questions in next
> email.
>
> As you may know, In SQL, there are two kinds of data types: one is
> without tiemzone and another is without timezone.
>
> SQL standard types without timezone:
> * Timestamp
> * Time
>
> SQL standard types with timezone:
> * Timestampz
> * Timez
>
> Each timestampz or Timez value has timezone offset itself. Both types
> are straightforward. Also, Date type does not have 'with timezone'
> type.
>
> BTW, 'types without timezone' (e.g., Timestamp and Time) can be
> affected by an external timezone in almost all SQL systems that I've
> seen. In other words, their timezone can be present in outside of data
> values. By default, many systems use the timezone of local machine.
> Similarly, Tajo uses a static variable 'TajoConf.CURRENT_TIMEZONE'.
>
> The main issue is when data values are adjusted by timezone. The
> adjustment should be performed before Datum is created. In other
> words,  in scanner, Timestamp or Time value should be adjusted by
> timezone. TimestampDatum or TimeDatum must be with UTC+0. But, in the
> current implementation, some parts of TimestampDatum and TimeDatum
> address directly timezone. It's wrong in my opinion. Also, all unit
> tests for TimestampDatum and TimeDatum should be performed with UTC.
>
> In addition, timestamp or time values should be adjusted one more by
> 'user timezone' when an user gets the query results. In sum, the flow
> is as follows:
>
> Timestamp or Time value in an Input table ---(adjust by an external
> timezone or local timezone)---> TimestampDatum or TimeDatum with UTC
> ---> query processing  ---(adjust by an user timezone)-----> Print
>
> The reason why TimestampDatum and TimeDaum values must be addressed in
> UTC is the simplicity of processing codes. In many cases, two or more
> tables used in one query can have multiple different timezones. If all
> values are in UTC, the processing does not need to consider different
> timezones. But, the current implementation has one global timezone. We
> should fix it, and I created the jira issue for it at
> https://issues.apache.org/jira/browse/TAJO-1186.
>
> Best regards,
> Hyunsik
>
> On Fri, Nov 21, 2014 at 12:50 AM, Jaewoong Jung <jungjw@gmail.com> wrote:
> > There's another issue that hopefully Hyunsik would be able to clarify,
> > and it's very crucial to handling timezones in these data types.
> >
> > Q: So, let's say (and I agree) TimeDatum represents an instant, so can
> > be timezoned. Then, is it a UTC time or a local time?
> >
> > Let me explain why this question is important.
> >
> > DateDatum represents an instant. And, it is implicitly timezoned to
> > the user local time. Which means, if I use '11/20/2014', it's
> > '11/20/2014 0:00:00 PST' and is equivalent to a TimestampDatum for
> > '11/20/2014 8:00:00 UTC'. (BTW, the compareTo implementation for
> > DateDatum is broken in this regard. I'll file a separate issue for
> > that.)
> >
> > What about TimeDatum? It is currently timezoned to UTC, (surprise,
> > anyone?) if I understood the code correctly. When we add a TimeDatum
> > to a DateDatum, we convert the TimeDatum to the user local time, which
> > implies TimeDatum is UTC-based. (Also, the comment next to the line
> > explicitly mentions it.)
> >
> > So, here's the problem.
> >
> > Why do they have different timezones? They're incomplete as an instant
> > when used alone and are complementary to each other. This is an
> > important concept. To understand it, you have to think about why
> > adding DateDatum and TimeDatum is allowed in the code. Originally,
> > instants can't be added. (And, that's why I thought TimeDatum is not
> > an instant.) You can't add (say) 11/20/2014 to 11/23/2019. Subtracting
> > an instant from an instant makes sense and results in a period, but
> > they can't be added.
> >
> > However, in the case of DateDatum and TimeDatum, additions are allowed
> > because they're complementary to each other, and what the code does
> > **conceptually** is concatenate the two.
> >
> > Therefore, because they're intended to be used together, I'd argue
> > they shouldn't have different timezones. Also, if they have different
> > timezones, additions can't have a simple correct answer. What's the
> > correct answer of 11/20/2014 (PST) + 8:00:00 (UTC)? There's no clear
> > answer because they can't be simply concatenated.
> >
> > (FWIW, the current Tajo code thinks the answer is 11/20/2014 8:00:00
> > in UTC. How many of you got it right?)
> >
> > This can cause a lot of confusion to users. When they use a date
> > alone, it is interpreted as a local time. But, as soon as they add a
> > time to it, it is silently converted to UTC in a way which is very
> > unexpected to many users.
> >
> > Why am I emphasizing it is unexpected? Look at the comparison below.
> >
> > What's the answer to 11/20/2014 (DateDatum) + 8 hours (IntervalDatum)?
> > It's 11/20/2014 8:00 in a local time (PST on my machine). How about
> > 11/20/2014 (DateDatum) + 8:00:00 (TimeDatum)? It's 11/20/2014 8:00 in
> > UTC as I wrote above. How many users would be able to expect this?
> >
> > So, coming back to my original question. What timezone should a
> > TimeDatum have? UTC or local time? It's currently UTC. But, I believe
> > it should be changed to the local time zone.
> >
> > (Sorry for the long email. But, I think it's critical to get this
> > right and build consensus over it so that we can provide a consistent
> > behavior going forward. The actual fix will be really simple, though.)
> >
> > On Thu, Nov 20, 2014 at 2:44 AM, Hyunsik Choi <hyunsik.choi@gmail.com>
> wrote:
> >> Those parts have poor documentation.
> >>
> >> I agree with your investigation. I also could find many misuse of
> timezone
> >> in many parts. We should make them clear and fix them in this chance.
> >>
> >> I just got off the plane, and I'm still on the road. So, I'll give more
> >> comments tomorrow.
> >>
> >> - hyunsik
> >>
> >> On Thu, Nov 20, 2014 at 5:24 PM, Jaewoong Jung <jungjw@gmail.com>
> wrote:
> >>
> >>> Wow, this seemingly trivial issue has surprisingly many problems
> involved.
> >>>
> >>> The most critical one, though, is TimeMeta class. Presumably because
> >>> it is poorly documented, it is being used for two different purposes
> >>> in Tajo code base. Some code treats it as a date time representation,
> >>> which I believe is the original intention, but some treat it as the
> >>> human-readable equivalent of TimeDatum by completely ignoring
> >>> date-related fields.
> >>>
> >>> For example, DateTimeUtil.date2j(long julianDate, TimeMeta tm), which
> >>> converts a julian timestamp to a TimeMeta value, doesn't touch
> >>> dayOfMonth, monthOfYear, or years values and just puts all values for
> >>> hour and above units in hours field.
> >>>
> >>> There are other minor problems like incorrect comments and absent
> >>> default values, but the most critical one is misuse of TimeMeta.
> >>>
> >>> I'll try to break up my patches so that each has a clear and
> >>> easy-to-understand goal. Sending this heads-up to let you know
> >>> there'll be more issues filed and patches sent than you might expect.
> >>>
> >>> On Wed, Nov 19, 2014 at 10:56 PM, Jaewoong Jung <jungjw@gmail.com>
> wrote:
> >>> > Yeah, after some more research, I found that TImeDatum is a somewhat
> >>> > ambiguous data type.
> >>> >
> >>> > Its original purpose is to represent a time of a day, i.e. hh:mm:ss
> >>> > part of an instant. So, it could be viewed as instant data though
> some
> >>> > may argue it's incomplete to fully represent an instant. Anyway,
> given
> >>> > that TimestampDatum has limitation in terms of the time range it can
> >>> > represent, and given that Tajo doesn't have DateTime data type,
> >>> > clients should be allowed to use it with a timezone.
> >>> >
> >>> > I'll change the direction and try to address the issue by fixing the
> >>> > underflow error.
> >>> >
> >>> > Thanks for the input. :)
> >>> >
> >>> > On Wed, Nov 19, 2014 at 10:24 PM, Jong-young Park <
> eminency@gmail.com>
> >>> wrote:
> >>> >> Hi, Jaewoong.
> >>> >>
> >>> >> To express time period value, IntervalDatum is existing as I know.
> >>> >>
> >>> >> So I think it is right that TimeDatum is for some time point.
> >>> >>
> >>> >> And TimestampDatum seems it is doing both roles of DateDatum and
> >>> TimeDatum.
> >>> >>
> >>> >> Regards,
> >>> >> Jongyoung.
> >>> >>
> >>> >> On Wed Nov 19 2014 at 오후 5:23:40 Jaewoong Jung <jungjw@gmail.com>
> >>> wrote:
> >>> >>
> >>> >>> It turns out, TAJO-1191 is slightly more complicated than I
> thought.
> >>> >>> (https://issues.apache.org/jira/browse/TAJO-1191)
> >>> >>>
> >>> >>> Basically, it's about whether TimeDatum may have a timezone
tied
> with
> >>> >>> it. I **believe** TimeDatum is originally designed to hold
a time
> >>> >>> period value, not an instant. (TimestampDatum seems to be the
> >>> >>> canonical container for instants.) So, it doesn't make sense
to
> apply
> >>> >>> any timezones to TimeDatum values, but it's being done in a
few
> >>> >>> places. And, that's why the test is failing on my machines.
> >>> >>>
> >>> >>> I'm going to try to fix it by removing all timezone-related
> references
> >>> >>> around the class, but I want to check my assumption with you
> before I
> >>> >>> proceed.
> >>> >>>
> >>> >>> What do you think about it?
> >>> >>>
> >>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message