spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zoltan Ivanfi ...@cloudera.com>
Subject Re: SQL TIMESTAMP semantics vs. SPARK-18350
Date Fri, 02 Jun 2017 15:33:33 GMT
Hi,

We would like to solve the problem of interoperability of existing data,
and that is the main use case for having table-level control. Spark should
be able to read timestamps written by Impala or Hive and at the same time
read back its own data. These have different semantics, so having a single
flag is not enough.

Two separate types will solve this problem indeed, but only once every
component involved supports them. Unfortunately, adding these separate SQL
types is a larger effort that is only feasible in the long term and we
would like to provide a short-term solution for interoperability in the
meantime.

Br,

Zoltan

On Fri, Jun 2, 2017 at 1:32 AM Reynold Xin <rxin@databricks.com> wrote:

> Yea I don't see why this needs to be per table config. If the user wants
> to configure it per table, can't they just declare the data type on a per
> table basis, once we have separate types for timestamp w/ tz and w/o tz?
>
> On Thu, Jun 1, 2017 at 4:14 PM, Michael Allman <michael@videoamp.com>
> wrote:
>
>> I would suggest that making timestamp type behavior configurable and
>> persisted per-table could introduce some real confusion, e.g. in queries
>> involving tables with different timestamp type semantics.
>>
>> I suggest starting with the assumption that timestamp type behavior is a
>> per-session flag that can be set in a global `spark-defaults.conf` and
>> consider more granular levels of configuration as people identify solid use
>> cases.
>>
>> Cheers,
>>
>> Michael
>>
>>
>>
>> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi <zi@cloudera.com> wrote:
>>
>> Hi,
>>
>> If I remember correctly, the TIMESTAMP type had UTC-normalized local time
>> semantics even before Spark 2, so I can understand that Spark considers it
>> to be the "established" behavior that must not be broken. Unfortunately,
>> this behavior does not provide interoperability with other SQL engines of
>> the Hadoop stack.
>>
>> Let me summarize the findings of this e-mail thread so far:
>>
>>    - Timezone-agnostic TIMESTAMP semantics would be beneficial for
>>    interoperability and SQL compliance.
>>    - Spark can not make a breaking change. For backward-compatibility
>>    with existing data, timestamp semantics should be user-configurable on a
>>    per-table level.
>>
>> Before going into the specifics of a possible solution, do we all agree
>> on these points?
>>
>> Thanks,
>>
>> Zoltan
>>
>> On Sat, May 27, 2017 at 8:57 PM Imran Rashid <irashid@cloudera.com>
>> wrote:
>>
>>> I had asked zoltan to bring this discussion to the dev list because I
>>> think it's a question that extends beyond a single jira (we can't figure
>>> out the semantics of timestamp in parquet if we don't k ow the overall goal
>>> of the timestamp type) and since its a design question the entire community
>>> should be involved.
>>>
>>> I think that a lot of the confusion comes because we're talking about
>>> different ways time zone affect behavior: (1) parsing and (2) behavior when
>>> changing time zones for processing data.
>>>
>>> It seems we agree that spark should eventually provide a timestamp type
>>> which does conform to the standard.   The question is, how do we get
>>> there?  Has spark already broken compliance so much that it's impossible to
>>> go back without breaking user behavior?  Or perhaps spark already has
>>> inconsistent behavior / broken compatibility within the 2.x line, so its
>>> not unthinkable to have another breaking change?
>>>
>>> (Another part of the confusion is on me -- I believed the behavior
>>> change was in 2.2, but actually it looks like its in 2.0.1.  That changes
>>> how we think about this in context of what goes into a 2.2
>>> release.  SPARK-18350 isn't the origin of the difference in behavior.)
>>>
>>> First: consider processing data that is already stored in tables, and
>>> then accessing it from machines in different time zones.  The standard is
>>> clear that "timestamp" should be just like "timestamp without time zone":
>>> it does not represent one instant in time, rather it's always displayed the
>>> same, regardless of time zone.  This was the behavior in spark 2.0.0 (and
>>> 1.6),  for hive tables stored as text files, and for spark's json formats.
>>>
>>> Spark 2.0.1  changed the behavior of the json format (I believe
>>> with SPARK-16216), so that it behaves more like timestamp *with* time
>>> zone.  It also makes csv behave the same (timestamp in csv was basically
>>> broken in 2.0.0).  However it did *not* change the behavior of a hive
>>> textfile; it still behaves like "timestamp with*out* time zone".  Here's
>>> some experiments I tried -- there are a bunch of files there for
>>> completeness, but mostly focus on the difference between
>>> query_output_2_0_0.txt vs. query_output_2_0_1.txt
>>>
>>> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70
>>>
>>> Given that spark has changed this behavior post 2.0.0, is it still out
>>> of the question to change this behavior to bring it back in line with the
>>> sql standard for timestamp (without time zone) in the 2.x line?  Or, as
>>> reynold proposes, is the only option at this point to add an off-by-default
>>> feature flag to get "timestamp without time zone" semantics?
>>>
>>>
>>> Second, there is the question of parsing strings into timestamp type.
>>> I'm far less knowledgeable about this, so I mostly just have questions:
>>>
>>> * does the standard dictate what the parsing behavior should be for
>>> timestamp (without time zone) when a time zone is present?
>>>
>>> * if it does and spark violates this standard is it worth trying to
>>> retain the *other* semantics of timestamp without time zone, even if we
>>> violate the parsing part?
>>>
>>> I did look at what postgres does for comparison:
>>>
>>> https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c
>>>
>>> spark's timestamp certainly does not match postgres's timestamp for
>>> parsing, it seems closer to postgres's "timestamp with timezone" -- though
>>> I dunno if that is standard behavior at all.
>>>
>>> thanks,
>>> Imran
>>>
>>> On Fri, May 26, 2017 at 1:27 AM, Reynold Xin <rxin@databricks.com>
>>> wrote:
>>>
>>>> That's just my point 4, isn't it?
>>>>
>>>>
>>>> On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <ofir.manor@equalum.io>
>>>> wrote:
>>>>
>>>>> Reynold,
>>>>> my point is that Spark should aim to follow the SQL standard instead
>>>>> of rolling its own type system.
>>>>> If I understand correctly, the existing implementation is similar to
>>>>> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
>>>>> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH
>>>>> TIMEZONE data types which are missing from Spark.
>>>>> So, it is better (for me) if instead of extending the existing types,
>>>>> Spark would just implement the additional well-defined types properly.
>>>>> Just trying to copy-paste CREATE TABLE between SQL engines should not
>>>>> be an exercise of flags and incompatibilities.
>>>>>
>>>>> Regarding the current behaviour, if I remember correctly I had to
>>>>> force our spark O/S user into UTC so Spark wont change my timestamps.
>>>>>
>>>>> Ofir Manor
>>>>>
>>>>> Co-Founder & CTO | Equalum
>>>>>
>>>>> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>>>>>
>>>>> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <rxin@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Zoltan,
>>>>>>
>>>>>> Thanks for raising this again, although I'm a bit confused since
I've
>>>>>> communicated with you a few times on JIRA and on private emails to
explain
>>>>>> that you have some misunderstanding of the timestamp type in Spark
and some
>>>>>> of your statements are wrong (e.g. the except text file part). Not
sure why
>>>>>> you didn't get any of those.
>>>>>>
>>>>>>
>>>>>> Here's another try:
>>>>>>
>>>>>>
>>>>>> 1. I think you guys misunderstood the semantics of timestamp in Spark
>>>>>> before session local timezone change. IIUC, Spark has always assumed
>>>>>> timestamps to be with timezone, since it parses timestamps with timezone
>>>>>> and does all the datetime conversions with timezone in mind (it doesn't
>>>>>> ignore timezone if a timestamp string has timezone specified). The
session
>>>>>> local timezone change further pushes Spark to that direction, but
the
>>>>>> semantics has been with timezone before that change. Just run Spark
on
>>>>>> machines with different timezone and you will know what I'm talking
about.
>>>>>>
>>>>>> 2. CSV/Text is not different. The data type has always been "with
>>>>>> timezone". If you put a timezone in the timestamp string, it parses
the
>>>>>> timezone.
>>>>>>
>>>>>> 3. We can't change semantics now, because it'd break all existing
>>>>>> Spark apps.
>>>>>>
>>>>>> 4. We can however introduce a new timestamp without timezone type,
>>>>>> and have a config flag to specify which one (with tz or without tz)
is the
>>>>>> default behavior.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <zi@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Sorry if you receive this mail twice, it seems that my first
attempt
>>>>>>> did not make it to the list for some reason.
>>>>>>>
>>>>>>> I would like to start a discussion about SPARK-18350
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-18350> before
it gets
>>>>>>> released because it seems to be going in a different direction
than what
>>>>>>> other SQL engines of the Hadoop stack do.
>>>>>>>
>>>>>>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP
WITHOUT
>>>>>>> TIME ZONE) to have timezone-agnostic semantics - basically a
type that
>>>>>>> expresses readings from calendars and clocks and is unaffected
by time
>>>>>>> zone. In the Hadoop stack, Impala has always worked like this
and recently
>>>>>>> Presto also took steps
>>>>>>> <https://github.com/prestodb/presto/issues/7122> to become
>>>>>>> standards compliant. (Presto's design doc
>>>>>>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>>>>>>> also contains a great summary of the different semantics.) Hive
has a
>>>>>>> timezone-agnostic TIMESTAMP type as well (except for Parquet,
a major
>>>>>>> source of incompatibility that is already being addressed
>>>>>>> <https://issues.apache.org/jira/browse/HIVE-12767>). A
TIMESTAMP in
>>>>>>> SparkSQL, however, has UTC-normalized local time semantics (except
for
>>>>>>> textfile), which is generally the semantics of the TIMESTAMP
WITH TIME ZONE
>>>>>>> type.
>>>>>>>
>>>>>>> Given that timezone-agnostic TIMESTAMP semantics provide standards
>>>>>>> compliance and consistency with most SQL engines, I was wondering
whether
>>>>>>> SparkSQL should also consider it in order to become ANSI SQL
compliant and
>>>>>>> interoperable with other SQL engines of the Hadoop stack. Should
SparkSQL
>>>>>>> adapt this semantics in the future, SPARK-18350
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-18350> may
turn out to
>>>>>>> be a source of problems. Please correct me if I'm wrong, but
this change
>>>>>>> seems to explicitly assign TIMESTAMP WITH TIME ZONE semantics
to the
>>>>>>> TIMESTAMP type. I think SPARK-18350 would be a great feature
for a separate
>>>>>>> TIMESTAMP WITH TIME ZONE type, but the plain unqualified TIMESTAMP
type
>>>>>>> would be better becoming timezone-agnostic instead of gaining
further
>>>>>>> timezone-aware capabilities. (Of course becoming timezone-agnostic
would be
>>>>>>> a behavior change, so it must be optional and configurable by
the user, as
>>>>>>> in Presto.)
>>>>>>>
>>>>>>> I would like to hear your opinions about this concern and about
>>>>>>> TIMESTAMP semantics in general. Does the community agree that
a
>>>>>>> standards-compliant and interoperable TIMESTAMP type is desired?
Do you
>>>>>>> perceive SPARK-18350 as a potential problem in achieving this
or do I
>>>>>>> misunderstand the effects of this change?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Zoltan
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> List of links in case in-line links do not work:
>>>>>>>
>>>>>>>    - SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350
>>>>>>>    - Presto's change: https://github.com/prestodb/presto/issues/7122
>>>>>>>    - Presto's design doc:
>>>>>>>    https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message