spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yash Datta (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL
Date Fri, 12 Jan 2018 08:48:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yash Datta updated SPARK-23056:
-------------------------------
    Labels: regresion  (was: )

> parse_url regression when switched to using java.net.URI instead of java.net.URL
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-23056
>                 URL: https://issues.apache.org/jira/browse/SPARK-23056
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.3, 2.2.2, 2.3.0
>            Reporter: Yash Datta
>              Labels: regression
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф"
> {code}
> The parse_url returns null, but works fine when using the hive 's version of parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
>     new URI(url.toString)
>   } catch {
>     case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this case:
> {code:java}
> val url = "http://правительство.рф"
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost")     // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
> <https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> ```SQL
> SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p')
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message