spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexandre Dupriez (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-18350) Support session local timezone
Date Thu, 12 Oct 2017 17:31:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202322#comment-16202322
] 

Alexandre Dupriez edited comment on SPARK-18350 at 10/12/17 5:30 PM:
---------------------------------------------------------------------

Hello all,

I have a use case where a {{Dataset}} contains a column of type {{java.sql.Timestamp}} (let's
call it {{_time}}) which I am using to derive new columns with the year, month, day and hour
specified by the {{_time}} column, with something like:
{code:java}
session.read.schema(mySchema)
            .json(path)
            .withColumn("year", year($"_time"))
            .withColumn("month", month($"_time"))
            .withColumn("day", dayofmonth($"_time"))
            .withColumn("hour", hour($"_time"))
{code}
using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions defined in {{org.apache.spark.sql.functions}}.

Now let's assume the timezone is row dependent - and let's call {{_tz}} the column which contains
it.The timezone is at the row level which is why I cannot configure the {{DataFrameWriter}}
with a {{timeZone}} option.
I wondered if something like this would be advisable:
{code:java}
session.read.schema(mySchema)
            .json(path)
            .withColumn("year", year($"_time"))
            .withColumn("month", month($"_time"))
            .withColumn("day", dayofmonth($"_time"))
            .withColumn("hour", hour($"_time", $"_tz"))
{code}
Having a look at the definition of the {{hour}} function, it uses an {{Hour}} expression which
can be constructed with an optional {{timeZoneId}}.
I have been trying to create an {{Hour}} expression but this is Spark-internal construct -
and the API forbids to use it directly.
I guess providing a function {{hour(t: Column, tz: Column)}} along with the existing {{hour(t:
Column)}} would not be a satisfying design.

Do you think a somehow elegant solution exists for this use case? Or is the methodology I
use flawed - i.e. I should not derive the hour from a timestamp column if it happens to rely
on a not predefined, row-dependent time zone like this?


was (Author: hangleton):
Hello all,

I have a use case where a {{Dataset}} contains a column of type {{java.sql.Timestamp}} (let's
call it {{_time}}) which I am using to derive new columns with the year, month, day and hour
specified by the {{_time}} column, with something like:
{code:java}
session.read.schema(mySchema)
           .json(path)
           .withColumn("year", year($"_time"))
           .withColumn("month", month($"_time"))
           .withColumn("day", dayofmonth($"_time"))
           .withColumn("hour", hour($"_time"))
{code}
using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions defined in {{org.apache.spark.sql.functions}}.

Now let's assume the timezone is row dependent - and let's call {{_tz}} the column which contains
it.The timezone is at the row level which is why I cannot configure the {{DataFrameWriter}}
with a {{timeZone}} option.
I wondered if something like this would be advisable:
{code:java}
session.read.schema(mySchema)
          .json(path)
          .withColumn("year", year($"_time"))
          .withColumn("month", month($"_time"))
          .withColumn("day", dayofmonth($"_time"))
          .withColumn("hour", hour($"_time", $"_tz"))
{code}
Having a look at the definition of the {{hour}} function, it uses an {{Hour}} expression which
can be constructed with an optional {{timeZoneId}}.
I have been trying to create an {{Hour}} expression but this is Spark-internal construct -
and the API forbids to use it directly.
I guess providing a function {{hour(t: Column, tz: Column)}} along with the existing {{hour(t:
Column)}} would not be a satisfying design.

Do you think a somehow elegant solution exists for this use case? Or is the methodology I
use flawed - i.e. I should not derive the hour from a timestamp column if it happens to rely
on a not predefined, row-dependent time zone like this?

> Support session local timezone
> ------------------------------
>
>                 Key: SPARK-18350
>                 URL: https://issues.apache.org/jira/browse/SPARK-18350
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Takuya Ueshin
>              Labels: releasenotes
>             Fix For: 2.2.0
>
>         Attachments: sample.csv
>
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime manipulation, which
is bad if users are not in the same timezones as the machines, or if different users have
different timezones.
> We should introduce a session local timezone setting that is used for execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message