spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <>
Subject Re: SQL TIMESTAMP semantics vs. SPARK-18350
Date Fri, 26 May 2017 06:27:21 GMT
That's just my point 4, isn't it?

On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <> wrote:

> Reynold,
> my point is that Spark should aim to follow the SQL standard instead of
> rolling its own type system.
> If I understand correctly, the existing implementation is similar to
> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE
> data types which are missing from Spark.
> So, it is better (for me) if instead of extending the existing types,
> Spark would just implement the additional well-defined types properly.
> Just trying to copy-paste CREATE TABLE between SQL engines should not be
> an exercise of flags and incompatibilities.
> Regarding the current behaviour, if I remember correctly I had to force
> our spark O/S user into UTC so Spark wont change my timestamps.
> Ofir Manor
> Co-Founder & CTO | Equalum
> Mobile: +972-54-7801286 | Email:
> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <> wrote:
>> Zoltan,
>> Thanks for raising this again, although I'm a bit confused since I've
>> communicated with you a few times on JIRA and on private emails to explain
>> that you have some misunderstanding of the timestamp type in Spark and some
>> of your statements are wrong (e.g. the except text file part). Not sure why
>> you didn't get any of those.
>> Here's another try:
>> 1. I think you guys misunderstood the semantics of timestamp in Spark
>> before session local timezone change. IIUC, Spark has always assumed
>> timestamps to be with timezone, since it parses timestamps with timezone
>> and does all the datetime conversions with timezone in mind (it doesn't
>> ignore timezone if a timestamp string has timezone specified). The session
>> local timezone change further pushes Spark to that direction, but the
>> semantics has been with timezone before that change. Just run Spark on
>> machines with different timezone and you will know what I'm talking about.
>> 2. CSV/Text is not different. The data type has always been "with
>> timezone". If you put a timezone in the timestamp string, it parses the
>> timezone.
>> 3. We can't change semantics now, because it'd break all existing Spark
>> apps.
>> 4. We can however introduce a new timestamp without timezone type, and
>> have a config flag to specify which one (with tz or without tz) is the
>> default behavior.
>> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <> wrote:
>>> Hi,
>>> Sorry if you receive this mail twice, it seems that my first attempt did
>>> not make it to the list for some reason.
>>> I would like to start a discussion about SPARK-18350
>>> <> before it gets
>>> released because it seems to be going in a different direction than what
>>> other SQL engines of the Hadoop stack do.
>>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT
>>> TIME ZONE) to have timezone-agnostic semantics - basically a type that
>>> expresses readings from calendars and clocks and is unaffected by time
>>> zone. In the Hadoop stack, Impala has always worked like this and recently
>>> Presto also took steps <>
>>> to become standards compliant. (Presto's design doc
>>> <>
>>> also contains a great summary of the different semantics.) Hive has a
>>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
>>> source of incompatibility that is already being addressed
>>> <>). A TIMESTAMP in
>>> SparkSQL, however, has UTC-normalized local time semantics (except for
>>> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
>>> type.
>>> Given that timezone-agnostic TIMESTAMP semantics provide standards
>>> compliance and consistency with most SQL engines, I was wondering whether
>>> SparkSQL should also consider it in order to become ANSI SQL compliant and
>>> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
>>> adapt this semantics in the future, SPARK-18350
>>> <> may turn out to be
>>> a source of problems. Please correct me if I'm wrong, but this change seems
>>> to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP
>>> type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP
>>> WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be
>>> better becoming timezone-agnostic instead of gaining further timezone-aware
>>> capabilities. (Of course becoming timezone-agnostic would be a behavior
>>> change, so it must be optional and configurable by the user, as in Presto.)
>>> I would like to hear your opinions about this concern and about
>>> TIMESTAMP semantics in general. Does the community agree that a
>>> standards-compliant and interoperable TIMESTAMP type is desired? Do you
>>> perceive SPARK-18350 as a potential problem in achieving this or do I
>>> misunderstand the effects of this change?
>>> Thanks,
>>> Zoltan
>>> ---
>>> List of links in case in-line links do not work:
>>>    -
>>>    SPARK-18350:
>>>    -
>>>    Presto's change:
>>>    -
>>>    Presto's design doc:
>>>    nt/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit
>>>    <>

View raw message