Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 536CC200B35 for ; Tue, 5 Jul 2016 10:08:15 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 520AB160A60; Tue, 5 Jul 2016 08:08:15 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DAAA1160A4F for ; Tue, 5 Jul 2016 10:08:13 +0200 (CEST) Received: (qmail 20840 invoked by uid 500); 5 Jul 2016 08:08:11 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 20819 invoked by uid 99); 5 Jul 2016 08:08:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Jul 2016 08:08:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 086262C02A5 for ; Tue, 5 Jul 2016 08:08:11 +0000 (UTC) Date: Tue, 5 Jul 2016 08:08:11 +0000 (UTC) From: "Khurram Faraaz (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (DRILL-4765) Missing, incorrect information in Drill data types page MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 05 Jul 2016 08:08:15 -0000 [ https://issues.apache.org/jira/browse/DRILL-4765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Khurram Faraaz updated DRILL-4765: ---------------------------------- Priority: Major (was: Minor) > Missing, incorrect information in Drill data types page > ------------------------------------------------------- > > Key: DRILL-4765 > URL: https://issues.apache.org/jira/browse/DRILL-4765 > Project: Apache Drill > Issue Type: Improvement > Components: Documentation > Affects Versions: 1.6.0 > Reporter: Paul Rogers > > Consider the Drill Supported Types page: https://drill.apache.org/docs/supported-data-types/ > A number of issues can be seen. > For BIGINT, it would be clearer to express the range as: -2^63 to 2^63-1. > For INTEGER, it would be clearer to express the range as: -2^31 to 2^31-1. > DATE: The statement "in YYYY-MM-DD format" is wrong. The internal representation has no format, it is just a number representing the day count. The format is applied only on output and varies depending on the tool used. Perhaps for the Drill web UI it is in ISO format. > DATE: Presumably the date is not time-zone specific. That is, 2016-07-01 is the first of July in both the US and India, though a given absolute time may be on two different dates in these locations. > DATE: We use 4713 BC as a 0-point. But, the calendar system has changed many times since that date. (Indeed, the current system did not even exist on that date.) Is this a simple projection of the current system back in time, or does it adjust for the discontinuties in the Gregorian calendar? This should be stated as it is important for any data files that contain historical dates. (And is why choosing a 20th-century 0-point would have been better...) > FLOAT, DOUBLE: presumably these are in the standard IEEE Standard 754 format? If so, let's state that. > INTERVAL: there are many ways that intervals have been represented in DB systems. Parquet represents data as a triple: months, days and (milli)seconds. Does Drill use a similar format? If not, what is the format? A normal DB can declare the interval as part of the data declaration. How does Drill infer the format? How does the user access the parts of the range? > INTERVAL: the footnote says, "Internally, INTERVAL is represented as INTERVALDAY or INTERVALYEAR." But, if so, then INTERVAL can't represent a time interval: a serious limitation. Also, we can't convert a Parquet Interval to a Drill interval since there is no mapping to Drill that includes months, days and seconds. This is a huge limitation and should be explained. > SMALLINT: This is a supported types table, but the footnotes say SMALLINT is not supported. We also do not list the many internal Value Vector types we don't support (int8, uint8, int16, uint16, uint32 and so on.) Should we list SMALLINT if we don't actually support it? > TIME: the format is acutally number of seconds since 2001-01-01. The "24-hour based time ... in hours, minutes, seconds format" confuses display format with internal representation. See DATE above. > TIME: Presumably the time is in local time, not UTC. That is, the time is 12:34:56 with the time zone left unspecified. > TIME: The example for TIME is, "22:55:55.23". But, note that the example shows milliseconds, but the description says the time unit is seconds. Which is right? > TIME: The example shows just a time (seconds since midnight), but the description says that this is a timestamp: number of seconds since 2010-01-01. If so, then is TIME like TIMESTAMP (with a different basis)? Or is really a time-only value (so that the description is wrong?) > TIMESTAMP: The description says "JDBC timestamp", but this is not accurate. JDBC is a layer on top of a DB. So, we could say, "JDBC timestamp format". > TIMESTAMP: Explain the basis. A JDBC timestamp (https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html) expresses time in nanoseconds since 1970-01-01. So, does Drill also have nanosecond precision? The docs say, "optional milliseconds", so presumably Drill only keeps milliseconds, As a result, the Drill timestamp is NOT a JDBC timestamp. > TIMESTAMP: JDBC timestamps are vague. They are based on a Java Date which is defined as milliseconds since 1970-01-01T00:00:00 UTC. But, it seems a JDBC timestamp is local (it has no implied timezone). Does Drill assume that a TIMESTAMP is UTC (like java.util.Date) or local (like java.sql.Date)? > TIMESTAMP & DATE/TIME: We've created an incompatibility between the date & time format on the one hand, and TIMESTAMP on the other. We should explain how to convert between the two since it is non-obvious how this would be done without noodling out the conversion factor. Or, is a Drill timestamp also based on 2001-01-01 like a Drill date? > TIMESTAMP: "format: yyyy-MM-dd HH:mm:ss.SSS". Again, the timestamp does not have a format, it is just a count of millis (or nanos, see above.) As explained above, formatting is done by each tool and can be whatever the user wants. > CHAR: The description says, "The default limit is 1 character. The maximum character limit is 2,147,483,647." which seems to apply to the CHAR type: CHAR(1) to CHAR(2,147,483,647). But, the footnote says, "Currently, Drill supports only variable-length strings." So, the "default limit..." stuff does not actually apply, does it? > General: In a full DB, types are important because the user declares columns of the given type. Thus, I can specify DECIMAL(10,2) or CHAR(5) and it means something. But, Drill is a query-only engine. So, how are the types used? In general, types have to be inferred from data (or defined as casts in SQL or in views). So, we need to describe the type inference for each input source. And, the semantic rules that apply when converting data inside a view or query. As an example, what happens when we convert the incompatible Parquet INTERVAL to a Drill Interval? > General: Each issue should be discussed and resolved with development as some of the above may be more than just a documentation issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)