From user-return-534-archive-asf-public=cust-asf.ponee.io@impala.apache.org Fri Jan 12 06:15:32 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id B42D2180656 for ; Fri, 12 Jan 2018 06:15:32 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id A3F9F160C41; Fri, 12 Jan 2018 05:15:32 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7237A160C13 for ; Fri, 12 Jan 2018 06:15:31 +0100 (CET) Received: (qmail 71108 invoked by uid 500); 12 Jan 2018 05:15:30 -0000 Mailing-List: contact user-help@impala.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@impala.apache.org Delivered-To: mailing list user@impala.apache.org Received: (qmail 71095 invoked by uid 99); 12 Jan 2018 05:15:30 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Jan 2018 05:15:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 1CBDA1808DB for ; Fri, 12 Jan 2018 05:15:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.898 X-Spam-Level: * X-Spam-Status: No, score=1.898 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id qSYZBP_gwFhk for ; Fri, 12 Jan 2018 05:15:25 +0000 (UTC) Received: from mail-wr0-f193.google.com (mail-wr0-f193.google.com [209.85.128.193]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id DC97F5F39D for ; Fri, 12 Jan 2018 05:15:24 +0000 (UTC) Received: by mail-wr0-f193.google.com with SMTP id 100so4246573wrb.7 for ; Thu, 11 Jan 2018 21:15:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=VHqrn6ZDgkouIsoglxWUaPHzxLQ2pd3YvIcM4/yXWZg=; b=NAL2Is2D9MMyI89rSrdWCxyqCWRbEL7iH9Nd92o1YF/OOMGnoYuzzvKiMTp27fzxe1 m0yS9EI0Af3vGASHIp7WofYyw25pgf51HKF/zJyzVMkxGgEYgt4R9gKEpC3LFlXgyYqA e4BeulYyvtUkmrpXacr3vJJYEN+NYC/+fY86rZQzvdBAHCQTMpSWVhjdfv52Buv61foE j/XX4clO1Ig96m9KNiCDMiDcIlGqHz86zrOyHjRxLt92WJzIUIr5BGWnugMr5xX7nmIF I3Le1t8XYPZp3yF8Uo/p439jcZCQDGMEqDDZ7IqefDaqjYUljw7RnQM3IgIPdXqVHXLd Yysg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=VHqrn6ZDgkouIsoglxWUaPHzxLQ2pd3YvIcM4/yXWZg=; b=rZR7/WvY/Ld+pzhkmCRfk9cOOpH22wLzqGruJwBmLkYnDyRuAY4VsCn3AEP8fKcsfh Vqb2Cqde0wpaThkW3ejbQwhQ8OjNfOfV1EjSLsYKSqw+Nml9iUYC1twWFq/ERoSmHK8S dRd9S/89w/VxpiYPwkcVfsBqIPTCfkk6PTRySZvJ/hPgnfbxaeVmErbStBtmR1/+PUux v6J627MLku82ETiCVbIzaFqdFJPUpudC7L28KBeQ+dSuMKs8Sc90goaDEOHQ6RuLmm9J BJxQ6jFK0oCdCOcpsu4mv7eK8tBVvInrkyXXnqwjpIMAUsSokpwZOyboAHjVi2h8L+XS S8Rg== X-Gm-Message-State: AKwxytet8FDWrtGdcQsAOGRCDJYId/Kw3nzEU1q+WD1FAECDPAS3+9VK LRQJdwlFvpTd9YjQ10FqP9DimgLbiTR65TrQlKXMkA== X-Google-Smtp-Source: ACJfBov7puASCZaQatYBfXwRn4ZwizQ84ohX7122UekU4fPCotQVYM5LeTux1TwsDWeQ45M6Jrf/NdYkh/tmeCgaJNs= X-Received: by 10.223.139.218 with SMTP id w26mr2486846wra.67.1515734123526; Thu, 11 Jan 2018 21:15:23 -0800 (PST) MIME-Version: 1.0 Received: by 10.223.170.66 with HTTP; Thu, 11 Jan 2018 21:15:22 -0800 (PST) In-Reply-To: <2028035413.30456503.1513693418088.JavaMail.zimbra@comcast.net> References: <1207329184.28032206.1513466498823.JavaMail.zimbra@comcast.net> <825758698.28044564.1513467510021.JavaMail.zimbra@comcast.net> <2028035413.30456503.1513693418088.JavaMail.zimbra@comcast.net> From: Greg Rahn Date: Thu, 11 Jan 2018 21:15:22 -0800 Message-ID: Subject: Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...) To: user@impala.apache.org Content-Type: multipart/alternative; boundary="f403045e9cf001c00e05628d5a4d" --f403045e9cf001c00e05628d5a4d Content-Type: text/plain; charset="UTF-8" Franco- Are there still outstanding questions or issues with this? I'm just returning from holiday so wanted to check in. Agreed that timestamp w/ and w/o time zone are different beasts and come with their own challenges. Cheers, On Tue, Dec 19, 2017 at 6:23 AM, Franco Venturi wrote: > Thanks for your input Belazs. > I am still pondering about this issue and I found very useful to follow > your advice on seeing what's already in JIRA and doing some research on > what others are doing. > > I didn't find anything that exactly describes my problem on the Impala > JIRA (I might have missed it), however I did find a closely related > discussion that is occurring in the Hive JIRA: I am referring to JIRAs > HIVE-14305 (https://issues.apache.org/jira/browse/HIVE-14305), HIVE-14412 > (https://issues.apache.org/jira/browse/HIVE-14412), and HIVE-16418 ( > https://issues.apache.org/jira/browse/HIVE-16418). > > I found also useful to read how two of the most used RDBMS are dealing > with timestamps: > - Oracle (see: https://docs.oracle.com/en/database/oracle/oracle- > database/12.2/nlspg/datetime-data-types-and-time-zone-support.html) > - PostgreSQL (see: https://www.postgresql.org/docs/10/static/datatype- > datetime.html) > > Finally I am going to refer to the Java 8 API as described here: > https://docs.oracle.com/javase/8/docs/api/overview-summary.html; the > reason is that a lot of Impala users (me included) connect to it via > Java/JDBC and it is very important in my opinion that the Impala data types > are correctly mapped to the correct Java classes to avoid confusion. > > > After this long preamble here are some notes that hopefully will help in > this discussion. > > - At the center of this issue is that there are two significantly > different data types that we commonly refer to when we use the word > 'timestamp': > - timestamp without time zone (most commonly referred to as just > 'timestamp') > - timestamp with time zone > > - These two types are like apples and oranges in my opinion; getting > confused between the two (like I did) can cause a lot of frustration (my > first post in this thread shows that pretty well..) > > - The first data type ('timestamp without time zone' or just 'timestamp') > is best described in this sentence in section 4.2.1.6 (Choosing a TIMESTAMP > Data Type) in the Oracle document: > > "Use the TIMESTAMP data type when you need a datetime value to > record the time of an event without the time zone. For example, you can > store information about the times when workers punch a time card in and out > of their assembly line workstations. Because this is always a local time it > is then not needed to store the timezone part" > > - I think this is the kind of timestamp that is currently implemented in > Impala (please correct me if I am wrong) and in my opinion it should be > mapped to something like the Java type java.time.LocalDateTime; the Java 8 > API for java.time.LocalDateTime has this sentence that I think provides a > good insight on the meaning of this data type: > > "This class does not store or represent a time-zone. Instead, it > is a description of the date, as used for birthdays, combined with the > local time as seen on a wall clock. It cannot represent an instant on the > time-line without additional information such as an offset or time-zone." > > - Also this timestamp (again in my opinion) should not be mapped to the > Kudu type UNIXTIME_MICROS, because their meaning is different (and this is > what triggered my initial confusion, and I suspect I am not the only one > out there, who misunderstood this difference) > > > - For the second data type ('timestamp with time zone') I couldn't find a > good definition in the Oracle document, however this sentence from the Java > API for java.util.Date (of which java.sql.Timestamp is a subclass) offers a > good insight: > > "The class Date represents a specific instant in time" > > - This is also indirectly implied by the following sentence in the Oracle > document: > > "Two TIMESTAMP WITH TIME ZONE values are considered identical if > they represent the same instant in UTC, regardless of the TIME ZONE offsets > stored in the data." > > - The 'timestamp with time zone' is what I think should be mapped to the > Java type 'java.sql.Timezone' and what also corresponds to the actual > meaning of the Kudu type UNIXTIME_MICROS > > - This is the kind of timestamp type that is most useful to businesses > (financial, healthcase, security logs) because it captures the "absolute" > moment in time a transaction or an event occurred, regardless of which > timezone it occurred, or if it was daylight saving time or not > > > In conclusion the point I am at right now is that to really fix the issue > in Impala, two things should occur: > > - a new data type ('timestamp with time zone'), with the features > mentioned above, should be implemented - this change would hopefully not > break any of the existing applications > > - the current 'timestamp' type should be mapped to the Java class > java.time.LocalDateTime (not java.sql.Timestamp) - this would definitely > break existing applications, and therefore should be postponed to Impala 3.0 > > > Regards, > Franco > > > ------------------------------ > *From: *"Jeszy" > *To: *user@impala.apache.org > *Sent: *Monday, December 18, 2017 7:49:52 AM > *Subject: *Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...) > > Hello Franco, > > Thanks for your feedback! I agree there are pain points with using > timestamps, especially together with other systems. > Is there any particular approach or solution you propose that would > work well for you? Have you found any jiras on issues.apache.org that > describe what you're asking for? Commenting on a jira will help the > team track your input better. > > Regards, > Balazs > > On 17 December 2017 at 00:38, Franco Venturi wrote: > > Please note that the discussion below refers to the following versions: > > - Impala: v2.10.0-cdh5.13.0 > > - Kudu: 1.5.0-cdh5.13.0 > > - Everything runs on a standard Cloudera 5.13 installation > > > > > > A few days ago I was writing some Java code to migrate several tables > > directly from Oracle to Kudu (to be queried later on by our developers > and > > BI tools using Impala). Most of these tables have columns that are of > type > > "timestamp" (to be exact, they come in as instances of class > > oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest > of > > this discussion I'll assume we only deal with objects of > java.sql.Timestamp, > > to make things simple). > > As you probably know, Kudu, starting I think with version 1.3.1, has a > type > > called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala > > TIMESTAMP" data type > > (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_ > timestamp.html). > > > > > > A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the > > 'Apache Kudu Schema Design' document > > (https://kudu.apache.org/docs/schema_design.html), which says: > > > > > > unixtime_micros (64-bit microseconds since the Unix epoch) > > > > > > where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT. > > > > > > With this understanding I went ahead and wrote my Java code; when I ran > the > > first few tests, I noticed that the timestamp values returned by Impala > (I > > created in Impala an 'external' table 'stored as kudu') were off by > several > > hours compared to the values returned by the original table in Oracle > (our > > servers, both the Oracle ones and the Impala/Kudu ones, are all > configured > > in the 'America/New_York' timezone). > > > > > > To investigate this difference, I created a simple table in Kudu with > just > > two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a > > timestamp. I ran a few inserts and selects over this table in Impala and > > figured out that Impala stores a value that is more or less defined as > > follow: > > > > > > number of microseconds since the Unix epoch (i.e. what I was > expecting > > originally) > > + offset of the timestamp I inserted with respect to GMT (in my > case > > this offset is the offset for EST or EDT depending if that timestamp was > > during EST (winter) or EDT (summer)) > > > > > > This is how Impala achieves what is described as: > > > > > > Impala does not store timestamps using the local timezone, to avoid > > undesired results from unexpected time zone issues > > > > > > That same page has caveats like the following, that sent a shiver down my > > spine: > > > > > > If that value was written to a data file, and shipped off to a > distant > > server to be analyzed alongside other data from far-flung locations, the > > dates and times would not match up precisely because of time zone > > differences > > > > > > This means that if anyone is using (or even thinking about using) "Impala > > timestamps" to say store financial or health services (or security) > events, > > they'll find some nasty "surprises" (even if they don't plan to ever move > > their servers and only do business in one timezone). > > > > > > Consider for instance the case of anything that occurred between 1am and > 2am > > EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from > EDT > > to EST) - there's no way to store the timestamps for these events in Kudu > > via Impala. > > > > To prove this I wrote this simple piece of Java code (which uses Java 8 > and > > all well documented and non-deprecated classes and methods) to do just an > > insert and a select via Impala JDBC of a timestamp row in the simple > table > > that I mentioned above (primary key + timestamp column): > > > > > > > > // run insert > > long primaryKey = 1L; > > PreparedStatement insert = connection.prepareStatement("insert > into " > > + table + " values (?, ?)"); > > insert.setLong(1, primaryKey); > > Timestamp timestampIn = new Timestamp(1509859800000L); > > System.out.println("TimestampIn: " + timestampIn + " - getTime(): > " + > > timestampIn.getTime()); > > insert.setTimestamp(2, timestampIn); > > insert.executeUpdate(); > > insert.close(); > > > > > > // run select > > PreparedStatement select = connection.prepareStatement("select " + > > timestampColumn + " from " + table + " where " + primaryKeyColumn + > "=?"); > > select.setLong(1, primaryKey); > > ResultSet resultSet = select.executeQuery(); > > while (resultSet.next()) { > > Timestamp timestampOut = resultSet.getTimestamp(1); > > System.out.println("TimestampOut: " + timestampOut + " - > > getTime(): " + timestampOut.getTime()); > > } > > resultSet.close(); > > select.close(); > > > > > > > > and this is the output: > > > > > > TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000 > > TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000 > > > > > > > > If this kind of timestamp had been designed say in 1972, I would have > > probably been OK with it, but in 2017, when people and companies do > business > > 24/7, I find this approach (and sentences like "dates and times would not > > match up precisely") unacceptable. > > > > > > > > To its defense, the Impala TIMESTAMP web page mentioned above spends > several > > paragraphs about how to deal with "UTC timestamps" and problems like the > > ones shown above, but it basically has the user wrap everything in Impala > > functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'. > > > > > > This proposed solution - besides the performance impact of say having a > > query like this: > > > > > > select ... where to_utc_timestamp(timestamp_column_a) < > > to_utc_timestamp(timestamp_column_b) > > > > > > makes also much harder to write queries having timestamp fields (which is > > very common in my workplace) or to use it with automated BI tools. > > > > > > > > It is definitely not what one would expect from an analytic MPP database, > > that aspires to be used at the Enterprise level in 2017. > > > > > > > > Regards, > > Franco Venturi > > --f403045e9cf001c00e05628d5a4d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Franco-

Are there still outstanding que= stions or issues with this?=C2=A0 I'm just returning from holiday so wa= nted to check in. =C2=A0

Agreed that timestamp w/ = and w/o time zone are different beasts and come with their own challenges.= =C2=A0

Cheers,


On Tue, Dec 19, 2017 at 6:23 AM, Franc= o Venturi <fventuri@comcast.net> wrote:
Thanks for your input Belazs.
I am still pondering about t= his issue and I found very useful to follow your advice on seeing what'= s already in JIRA and doing some research on what others are doing.

= I didn't find anything that exactly describes my problem on the Impala = JIRA (I might have missed it), however I did find a closely related discuss= ion that is occurring in the Hive JIRA: I am referring to JIRAs HIVE-14305 = (https://issues.apache.org/jira/browse/HIVE-14305), HIVE-14412= (https://issues.apache.org/jira/browse/HIVE-14412), and HIVE-= 16418 (https://issues.apache.org/jira/browse/HIVE-16418).
=
I found also useful to read how two of the most used RDBMS are dealing = with timestamps:
=C2=A0=C2=A0 - Oracle (see: https://docs.oracle.com/en/database/oracle/oracle-database/12.2/nlspg/datetime-data-types-= and-time-zone-support.html)
=C2=A0=C2=A0 - PostgreSQL (see: https://www.postgresql.org/docs/10/static/datatype-datetime.html)

Finally I am going to refer to the Java 8 API = as described here: https://docs.oracle.com/javase/8= /docs/api/overview-summary.html; the reason is that a lot of Impal= a users (me included) connect to it via Java/JDBC and it is very important = in my opinion that the Impala data types are correctly mapped to the correc= t Java classes to avoid confusion.


After this long preamble here= are some notes that hopefully will help in this discussion.

- At th= e center of this issue is that there are two significantly different data t= ypes that we commonly refer to when we use the word 'timestamp':=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - timestamp without time zone (most commonl= y referred to as just 'timestamp')
=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 - timestamp with time zone

- These two types are like apples and= oranges in my opinion; getting confused between the two (like I did) can c= ause a lot of frustration (my first post in this thread shows that pretty w= ell..)

- The first data type ('timestamp without time zone' = or just 'timestamp') is best described in this sentence in section = 4.2.1.6 (Choosing a TIMESTAMP Data Type) in the Oracle document:

=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "Use the TIMESTAMP data type w= hen you need a datetime value to record the time of an event without the ti= me zone. For example, you can store information about the times when worker= s punch a time card in and out of their assembly line workstations. Because= this is always a local time it is then not needed to store the timezone pa= rt"

- I think this is the kind of timestamp that is currently i= mplemented in Impala (please correct me if I am wrong) and in my opinion it= should be mapped to something like the Java type java.time.LocalDateTime; = the Java 8 API for java.time.LocalDateTime has this sentence that I think p= rovides a good insight on the meaning of this data type:

=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "This class does not store or repres= ent a time-zone. Instead, it is a description of the date, as used for birt= hdays, combined with the local time as seen on a wall clock. It cannot repr= esent an instant on the time-line without additional information such as an= offset or time-zone."

- Also this timestamp (again in my opini= on) should not be mapped to the Kudu type UNIXTIME_MICROS, because their me= aning is different (and this is what triggered my initial confusion, and I = suspect I am not the only one out there, who misunderstood this difference)=


- For the second data type ('timestamp with time zone')= I couldn't find a good definition in the Oracle document, however this= sentence from the Java API for java.util.Date (of which java.sql.Timestamp= is a subclass) offers a good insight:

=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 "The class Date represents a specific instant in time&= quot;

- This is also indirectly implied by the following sentence in= the Oracle document:

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &qu= ot;Two TIMESTAMP WITH TIME ZONE values are considered identical if they rep= resent the same instant in UTC, regardless of the TIME ZONE offsets stored = in the data."

- The 'timestamp with time zone' is what = I think should be mapped to the Java type 'java.sql.Timezone' and w= hat also corresponds to the actual meaning of the Kudu type UNIXTIME_MICROS=

- This is the kind of timestamp type that is most= useful to businesses (financial, healthcase, security logs) because it cap= tures the "absolute" moment in time a transaction or an event occ= urred, regardless of which timezone it occurred, or if it was daylight savi= ng time or not


In conclusion the point I am at right now is that= to really fix the issue in Impala, two things should occur:

- a new= data type ('timestamp with time zone'), with the features mentione= d above, should be implemented - this change would hopefully not break any = of the existing applications

- the current 'timestamp' type = should be mapped to the Java class java.time.LocalDateTime (not java.sql.Ti= mestamp) - this would definitely break existing applications, and therefore= should be postponed to Impala 3.0


Regards,
Franco



From: "Jeszy"= <jeszyb@gmail.com= >
To: user@impala.apache.org
Sent: Monday, December 18, 201= 7 7:49:52 AM
Subject: Re: Impala, Kudu, and time= stamps (and UNIXTIME_MICROS...)

Hello Franco,

Thanks for your feedback! I agree ther= e are pain points with using
timestamps, especially together with other = systems.
Is there any particular approach or solution you propose that w= ould
work well for you? Have you found any jiras on issues.apache.org that
describe w= hat you're asking for? Commenting on a jira will help the
team track= your input better.

Regards,
Balazs

= On 17 December 2017 at 00:38, Franco Venturi <fventuri@comcast.net> wrote:
>= Please note that the discussion below refers to the following versions:> =C2=A0 =C2=A0 =C2=A0 - Impala: v2.10.0-cdh5.13.0
> =C2=A0 =C2= =A0 =C2=A0 - Kudu: 1.5.0-cdh5.13.0
> =C2=A0 =C2=A0 =C2=A0 - Everythin= g runs on a standard Cloudera 5.13 installation
>
>
> A f= ew days ago I was writing some Java code to migrate several tables
> = directly from Oracle to Kudu (to be queried later on by our developers and<= br>> BI tools using Impala). Most of these tables have columns that are = of type
> "timestamp" (to be exact, they come in as instanc= es of class
> oracle.sql.TIMESTAMP and I cast them to java.sql.Timest= amp; for the rest of
> this discussion I'll assume we only deal w= ith objects of java.sql.Timestamp,
> to make things simple).
> = As you probably know, Kudu, starting I think with version 1.3.1, has a type=
> called 'UNIXTIME_MICROS') and that type gets mapped by Imp= ala as "Impala
> TIMESTAMP" data type
> (https://www.cloudera.com/documentation/en= terprise/latest/topics/impala_timestamp.html).
>
>= ;
> A good description of the meaning of 'UNIXTIME_MICROS' in= Kudu is in the
> 'Apache Kudu Schema Design' document
>= ; (https://kudu.apache.org/docs/schema_design.html), which says= :
>
>
> =C2=A0 =C2=A0 =C2=A0 unixtime_micros (64-bit micr= oseconds since the Unix epoch)
>
>
> where the 'Unix = epoch' is defined as 1/1/1970 00:00:00 GMT.
>
>
> Wit= h this understanding I went ahead and wrote my Java code; when I ran the> first few tests, I noticed that the timestamp values returned by Impa= la (I
> created in Impala an 'external' table 'stored as = kudu') were off by several
> hours compared to the values returne= d by the original table in Oracle (our
> servers, both the Oracle one= s and the Impala/Kudu ones, are all configured
> in the 'America/= New_York' timezone).
>
>
> To investigate this differ= ence, I created a simple table in Kudu with just
> two columns, an IN= T64 as the primary key and a UNIXTIME_MICROS as a
> timestamp. I ran = a few inserts and selects over this table in Impala and
> figured out= that Impala stores a value that is more or less defined as
> follow:=
>
>
> =C2=A0 =C2=A0 =C2=A0 number of microseconds since = the Unix epoch (i.e. what I was expecting
> originally)
> =C2= =A0 =C2=A0 =C2=A0 + offset of the timestamp I inserted with respect to GMT = (in my case
> this offset is the offset for EST or EDT depending if t= hat timestamp was
> during EST (winter) or EDT (summer))
>
&= gt;
> This is how Impala achieves what is described as:
>
&g= t;
> =C2=A0 =C2=A0 =C2=A0 Impala does not store timestamps using the = local timezone, to avoid
> undesired results from unexpected time zon= e issues
>
>
> That same page has caveats like the follow= ing, that sent a shiver down my
> spine:
>
>
> =C2= =A0 =C2=A0 =C2=A0 If that value was written to a data file, and shipped off= to a distant
> server to be analyzed alongside other data from far-f= lung locations, the
> dates and times would not match up precisely be= cause of time zone
> differences
>
>
> This means t= hat if anyone is using (or even thinking about using) "Impala
> = timestamps" to say store financial or health services (or security) ev= ents,
> they'll find some nasty "surprises" (even if th= ey don't plan to ever move
> their servers and only do business i= n one timezone).
>
>
> Consider for instance the case of = anything that occurred between 1am and 2am
> EDT on 11/5/2017 (i.e. i= n the hour before we moved our clocks back from EDT
> to EST) - there= 's no way to store the timestamps for these events in Kudu
> via = Impala.
>
> To prove this I wrote this simple piece of Java cod= e (which uses Java 8 and
> all well documented and non-deprecated cla= sses and methods) to do just an
> insert and a select via Impala JDBC= of a timestamp row in the simple table
> that I mentioned above (pri= mary key + timestamp column):
>
>
>
> =C2=A0 =C2=A0= =C2=A0 // run insert
> =C2=A0 =C2=A0 =C2=A0 long primaryKey =3D 1L;<= br>> =C2=A0 =C2=A0 =C2=A0 PreparedStatement insert =3D connection.prepar= eStatement("insert into "
> + table + " values (?= , ?)");
> =C2=A0 =C2=A0 =C2=A0 insert.setLong(1, primaryKey);> =C2=A0 =C2=A0 =C2=A0 Timestamp timestampIn =3D new Timestamp(15098598= 00000L);
> =C2=A0 =C2=A0 =C2=A0 System.out.println("Timesta= mpIn: " + timestampIn + " - getTime(): " +
> timestamp= In.getTime());
> =C2=A0 =C2=A0 =C2=A0 insert.setTimestamp(2, timestam= pIn);
> =C2=A0 =C2=A0 =C2=A0 insert.executeUpdate();
> =C2=A0 = =C2=A0 =C2=A0 insert.close();
>
>
> =C2=A0 =C2=A0 =C2=A0 = // run select
> =C2=A0 =C2=A0 =C2=A0 PreparedStatement select =3D con= nection.prepareStatement("select " +
> timestampColumn= + " from " + table + " where " + primaryKeyColumn + &q= uot;=3D?");
> =C2=A0 =C2=A0 =C2=A0 select.setLong(1, primaryKey)= ;
> =C2=A0 =C2=A0 =C2=A0 ResultSet resultSet =3D select.executeQuery(= );
> =C2=A0 =C2=A0 =C2=A0 while (resultSet.next()) {
> =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 Timestamp timestampOut =3D resultSet.getTimesta= mp(1);
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 System.out.println("= TimestampOut: " + timestampOut + " -
> getTime(): &quo= t; + timestampOut.getTime());
> =C2=A0 =C2=A0 =C2=A0 }
> =C2=A0= =C2=A0 =C2=A0 resultSet.close();
> =C2=A0 =C2=A0 =C2=A0 select.close= ();
>
>
>
> and this is the output:
>
>=
> =C2=A0 =C2=A0 =C2=A0TimestampIn: 2017-11-05 01:30:00.0 - getTime()= : 1509859800000
> =C2=A0 =C2=A0 =C2=A0TimestampOut: 2017-11-05 01:30:= 00.0 - getTime(): 1509863400000
>
>
>
> If this kin= d of timestamp had been designed say in 1972, I would have
> probably= been OK with it, but in 2017, when people and companies do business
>= ; 24/7, I find this approach (and sentences like "dates and times woul= d not
> match up precisely") unacceptable.
>
>
&g= t;
> To its defense, the Impala TIMESTAMP web page mentioned above sp= ends several
> paragraphs about how to deal with "UTC timestamps= " and problems like the
> ones shown above, but it basically has= the user wrap everything in Impala
> functions like 'to_utc_time= stamp()' and 'from_utc_timestamp()'.
>
>
> Th= is proposed solution - besides the performance impact of say having a
&g= t; query like this:
>
>
> =C2=A0 =C2=A0 =C2=A0 select ...= where to_utc_timestamp(timestamp_column_a) <
> to_utc_timest= amp(timestamp_column_b)
>
>
> makes also much harder= to write queries having timestamp fields (which is
> very common in = my workplace) or to use it with automated BI tools.
>
>
>=
> It is definitely not what one would expect from an analytic MPP da= tabase,
> that aspires to be used at the Enterprise level in 2017.>
>
>
> Regards,
> Franco Venturi


--f403045e9cf001c00e05628d5a4d--