spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-18076) Fix default Locale used in DateFormat, NumberFormat to Locale.US
Date Mon, 24 Oct 2016 14:58:59 GMT

     [ https://issues.apache.org/jira/browse/SPARK-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-18076:
------------------------------------

    Assignee: Apache Spark

> Fix default Locale used in DateFormat, NumberFormat to Locale.US
> ----------------------------------------------------------------
>
>                 Key: SPARK-18076
>                 URL: https://issues.apache.org/jira/browse/SPARK-18076
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, Spark Core, SQL
>    Affects Versions: 2.0.1
>            Reporter: Sean Owen
>            Assignee: Apache Spark
>
> Many parts of the code use {{DateFormat}} and {{NumberFormat}} instances. Although the
behavior of these format is mostly determined by things like format strings, the exact behavior
can vary according to the platform's default locale. Although the locale defaults to "en",
it can be set to something else by env variables. And if it does, it can cause the same code
to succeed or fail based just on locale:
> {code}
> import java.text._
> import java.util._
> def parse(s: String, l: Locale) = new SimpleDateFormat("yyyyMMMdd", l).parse(s)
> parse("1989Dec31", Locale.US)
> Sun Dec 31 00:00:00 GMT 1989
> parse("1989Dec31", Locale.UK)
> Sun Dec 31 00:00:00 GMT 1989
> parse("1989Dec31", Locale.CHINA)
> java.text.ParseException: Unparseable date: "1989Dec31"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   at .parse(<console>:18)
>   ... 32 elided
> parse("1989Dec31", Locale.GERMANY)
> java.text.ParseException: Unparseable date: "1989Dec31"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   at .parse(<console>:18)
>   ... 32 elided
> {code}
> Where not otherwise specified, I believe all instances in the code should default to
some fixed value, and that should probably be {{Locale.US}}. This matches the JVM's default,
and specifies both language ("en") and region ("US") to remove ambiguity. This most closely
matches what the current code behavior would be (unless default locale was changed), because
it will currently default to "en".
> This affects SQL date/time functions. At the moment, the only SQL function that lets
the user specify language/country is "sentences", which is consistent with Hive.
> It affects dates passed in the JSON API. 
> It affects some strings rendered in the UI, potentially. Although this isn't a correctness
issue, there may be an argument for not letting that vary (?)
> It affects a bunch of instances where dates are formatted into strings for things like
IDs or file names, which is far less likely to cause a problem, but worth making consistent.
> The other occurrences are in tests.
> The downside to this change is also its upside: the behavior doesn't depend on default
JVM locale, but, also can't be affected by the default JVM locale. For example, if you wanted
to parse some dates in a way that depended on an non-US locale (not just the format string)
then it would no longer be possible. There's no means of specifying this, for example, in
SQL functions for parsing dates. However, controlling this by globally changing the locale
isn't exactly great either.
> The purpose of this change is to make the current default behavior deterministic and
fixed. PR coming.
> CC [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message