spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: SQL TIMESTAMP semantics vs. SPARK-18350
Date Fri, 26 May 2017 06:27:21 GMT
That's just my point 4, isn't it?


On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <ofir.manor@equalum.io> wrote:

> Reynold,
> my point is that Spark should aim to follow the SQL standard instead of
> rolling its own type system.
> If I understand correctly, the existing implementation is similar to
> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE
> data types which are missing from Spark.
> So, it is better (for me) if instead of extending the existing types,
> Spark would just implement the additional well-defined types properly.
> Just trying to copy-paste CREATE TABLE between SQL engines should not be
> an exercise of flags and incompatibilities.
>
> Regarding the current behaviour, if I remember correctly I had to force
> our spark O/S user into UTC so Spark wont change my timestamps.
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>
> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <rxin@databricks.com> wrote:
>
>> Zoltan,
>>
>> Thanks for raising this again, although I'm a bit confused since I've
>> communicated with you a few times on JIRA and on private emails to explain
>> that you have some misunderstanding of the timestamp type in Spark and some
>> of your statements are wrong (e.g. the except text file part). Not sure why
>> you didn't get any of those.
>>
>>
>> Here's another try:
>>
>>
>> 1. I think you guys misunderstood the semantics of timestamp in Spark
>> before session local timezone change. IIUC, Spark has always assumed
>> timestamps to be with timezone, since it parses timestamps with timezone
>> and does all the datetime conversions with timezone in mind (it doesn't
>> ignore timezone if a timestamp string has timezone specified). The session
>> local timezone change further pushes Spark to that direction, but the
>> semantics has been with timezone before that change. Just run Spark on
>> machines with different timezone and you will know what I'm talking about.
>>
>> 2. CSV/Text is not different. The data type has always been "with
>> timezone". If you put a timezone in the timestamp string, it parses the
>> timezone.
>>
>> 3. We can't change semantics now, because it'd break all existing Spark
>> apps.
>>
>> 4. We can however introduce a new timestamp without timezone type, and
>> have a config flag to specify which one (with tz or without tz) is the
>> default behavior.
>>
>>
>>
>> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <zi@cloudera.com> wrote:
>>
>>> Hi,
>>>
>>> Sorry if you receive this mail twice, it seems that my first attempt did
>>> not make it to the list for some reason.
>>>
>>> I would like to start a discussion about SPARK-18350
>>> <https://issues.apache.org/jira/browse/SPARK-18350> before it gets
>>> released because it seems to be going in a different direction than what
>>> other SQL engines of the Hadoop stack do.
>>>
>>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT
>>> TIME ZONE) to have timezone-agnostic semantics - basically a type that
>>> expresses readings from calendars and clocks and is unaffected by time
>>> zone. In the Hadoop stack, Impala has always worked like this and recently
>>> Presto also took steps <https://github.com/prestodb/presto/issues/7122>
>>> to become standards compliant. (Presto's design doc
>>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>>> also contains a great summary of the different semantics.) Hive has a
>>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
>>> source of incompatibility that is already being addressed
>>> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in
>>> SparkSQL, however, has UTC-normalized local time semantics (except for
>>> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
>>> type.
>>>
>>> Given that timezone-agnostic TIMESTAMP semantics provide standards
>>> compliance and consistency with most SQL engines, I was wondering whether
>>> SparkSQL should also consider it in order to become ANSI SQL compliant and
>>> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
>>> adapt this semantics in the future, SPARK-18350
>>> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to be
>>> a source of problems. Please correct me if I'm wrong, but this change seems
>>> to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP
>>> type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP
>>> WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be
>>> better becoming timezone-agnostic instead of gaining further timezone-aware
>>> capabilities. (Of course becoming timezone-agnostic would be a behavior
>>> change, so it must be optional and configurable by the user, as in Presto.)
>>>
>>> I would like to hear your opinions about this concern and about
>>> TIMESTAMP semantics in general. Does the community agree that a
>>> standards-compliant and interoperable TIMESTAMP type is desired? Do you
>>> perceive SPARK-18350 as a potential problem in achieving this or do I
>>> misunderstand the effects of this change?
>>>
>>> Thanks,
>>>
>>> Zoltan
>>>
>>> ---
>>>
>>> List of links in case in-line links do not work:
>>>
>>>    -
>>>
>>>    SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350
>>>    -
>>>
>>>    Presto's change: https://github.com/prestodb/presto/issues/7122
>>>    -
>>>
>>>    Presto's design doc: https://docs.google.com/docume
>>>    nt/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit
>>>    <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>>>
>>>
>>>
>>
>

Mime
View raw message