drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4765) Missing, incorrect information in Drill data types page
Date Wed, 06 Jul 2016 22:00:12 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365189#comment-15365189
] 

Paul Rogers commented on DRILL-4765:
------------------------------------

Suggestion on solution: for Parquet, the Date logical type declares the meaning of the int32
field. The Parquet reader should do a Parquet-to-Drill conversion step for each field. For
dates, that conversion means to change the Parquet date format to Drill's (by converting units
and/or 0-point.)

Ideally, the solution should be generic so it will work for Parquet interval types as well
in the future. Also, Parquet logical types. That is, maybe a table of conversion functions,
keyed by Parquet logical type.

An advantage of this approach is that we can then easily support the "no-op" Parquet logical
types of int_8, int_16, int_32, uint_8, and uint_16.

> Missing, incorrect information in Drill data types page
> -------------------------------------------------------
>
>                 Key: DRILL-4765
>                 URL: https://issues.apache.org/jira/browse/DRILL-4765
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 1.6.0
>            Reporter: Paul Rogers
>            Assignee: Bridget Bevens
>
> Consider the Drill Supported Types page: https://drill.apache.org/docs/supported-data-types/
> A number of issues can be seen.
> For BIGINT, it would be clearer to express the range as: -2^63 to 2^63-1.
> For INTEGER, it would be clearer to express the range as: -2^31 to 2^31-1.
> DATE: The statement "in YYYY-MM-DD format" is wrong. The internal representation has
no format, it is just a number representing the day count. The format is applied only on output
and varies depending on the tool used. Perhaps for the Drill web UI it is in ISO format.
> DATE: Presumably the date is not time-zone specific. That is, 2016-07-01 is the first
of July in both the US and India, though a given absolute time may be on two different dates
in these locations.
> DATE: We use 4713 BC as a 0-point. But, the calendar system has changed many times since
that date. (Indeed, the current system did not even exist on that date.) Is this a simple
projection of the current system back in time, or does it adjust for the discontinuties in
the Gregorian calendar? This should be stated as it is important for any data files that contain
historical dates. (And is why choosing a 20th-century 0-point would have been better...)
> FLOAT, DOUBLE: presumably these are in the standard IEEE Standard 754 format? If so,
let's state that.
> INTERVAL: there are many ways that intervals have been represented in DB systems. Parquet
represents data as a triple: months, days and (milli)seconds. Does Drill use a similar format?
If not, what is the format? A normal DB can declare the interval as part of the data declaration.
How does Drill infer the format? How does the user access the parts of the range?
> INTERVAL: the footnote says, "Internally, INTERVAL is represented as INTERVALDAY or INTERVALYEAR."
But, if so, then INTERVAL can't represent a time interval: a serious limitation. Also, we
can't convert a Parquet Interval to a Drill interval since there is no mapping to Drill that
includes months, days and seconds. This is a huge limitation and should be explained.
> SMALLINT: This is a supported types table, but the footnotes say SMALLINT is not supported.
We also do not list the many internal Value Vector types we don't support (int8, uint8, int16,
uint16, uint32 and so on.) Should we list SMALLINT if we don't actually support it?
> TIME: the format is acutally number of seconds since 2001-01-01. The "24-hour based time
... in hours, minutes, seconds format" confuses display format with internal representation.
See DATE above.
> TIME: Presumably the time is in local time, not UTC. That is, the time is 12:34:56 with
the time zone left unspecified.
> TIME: The example for TIME is, "22:55:55.23". But, note that the example shows milliseconds,
but the description says the time unit is seconds. Which is right?
> TIME: The example shows just a time (seconds since midnight), but the description says
that this is a timestamp: number of seconds since 2010-01-01. If so, then is TIME like TIMESTAMP
(with a different basis)? Or is really a time-only value (so that the description is wrong?)
> TIMESTAMP: The description says "JDBC timestamp", but this is not accurate. JDBC is a
layer on top of a DB. So, we could say, "JDBC timestamp format".
> TIMESTAMP: Explain the basis. A JDBC timestamp (https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html)
expresses time in nanoseconds since 1970-01-01. So, does Drill also have nanosecond precision?
The docs say, "optional milliseconds", so presumably Drill only keeps milliseconds, As a result,
the Drill timestamp is NOT a JDBC timestamp.
> TIMESTAMP: JDBC timestamps are vague. They are based on a Java Date which is defined
as milliseconds since 1970-01-01T00:00:00 UTC. But, it seems a JDBC timestamp is local (it
has no implied timezone). Does Drill assume that a TIMESTAMP is UTC (like java.util.Date)
or local (like java.sql.Date)?
> TIMESTAMP & DATE/TIME: We've created an incompatibility between the date & time
format on the one hand, and TIMESTAMP on the other. We should explain how to convert between
the two since it is non-obvious how this would be done without noodling out the conversion
factor. Or, is a Drill timestamp also based on 2001-01-01 like a Drill date?
> TIMESTAMP: "format: yyyy-MM-dd HH:mm:ss.SSS". Again, the timestamp does not have a format,
it is just a count of millis (or nanos, see above.) As explained above, formatting is done
by each tool and can be whatever the user wants.
> CHAR: The description says, "The default limit is 1 character. The maximum character
limit is 2,147,483,647." which seems to apply to the CHAR type: CHAR(1) to CHAR(2,147,483,647).
But, the footnote says, "Currently, Drill supports only variable-length strings." So, the
"default limit..." stuff does not actually apply, does it?
> General: In a full DB, types are important because the user declares columns of the given
type. Thus, I can specify DECIMAL(10,2) or CHAR(5) and it means something. But, Drill is a
query-only engine. So, how are the types used? In general, types have to be inferred from
data (or defined as casts in SQL or in views). So, we need to describe the type inference
for each input source. And, the semantic rules that apply when converting data inside a view
or query. As an example, what happens when we convert the incompatible Parquet INTERVAL to
a Drill Interval?
> General: Each issue should be discussed and resolved with development as some of the
above may be more than just a documentation issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message