spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michal Szafranski (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-22183) Inconsistency in LIKE escaping between literal values and column-based ones
Date Fri, 16 Mar 2018 15:20:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402042#comment-16402042
] 

Michal Szafranski commented on SPARK-22183:
-------------------------------------------

As for the reporters use case, using explicit 'contains()' function would not just work around
this issue, but also I would expect it to be significantly faster. I don't think it is mapped
in SQL though:

 
{code:java}
sqlContext.sql("""SELECT * FROM test t""").filter($"_1".contains($"_2")).show()
{code}
 

> Inconsistency in LIKE escaping between literal values and column-based ones
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-22183
>                 URL: https://issues.apache.org/jira/browse/SPARK-22183
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Adrien Lavoillotte
>            Priority: Minor
>
> I'm trying to implement auto-escaping for {{LIKE}} expressions, in order to have filters
& join conditions like:
> * Column A's value contains column B's
> * Column A's value contains some literal string
> So I need to escape {{LIKE}}-significant characters {{%}} and {{_}}. Since SparkSQL does
not support {{LIKE expr ESCAPE char}}, I need to escape using \, and presumably also \ itself
(twice in the case of literals, since '\​\' represents a single \​).
> But it seems that in a {{LIKE}} expression literal does not have quite the same escaping
as other literal strings or non-literals {{LIKE}} expressions, seemingly depending on whether
the left-hand side and/or right-hand side are literals or columns.
> Note: I'm using triple-quotes below to avoid scala-level \ escaping. And in the body
of this description, I'm purposedly using zero-width spaces to avoid Jira transforming my
\​.
> On Spark 2.2.0:
> {code}
> // both LHS & RHS literals
> scala> spark.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
> +---+---------+
> |  \|\ LIKE \\|
> +---+---------+
> |  \|     true|
> +---+---------+
> scala> spark.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
> org.apache.spark.sql.AnalysisException: the pattern '\' is invalid, it is not allowed
to end with the escape character;
>   at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>   at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:53)
>   at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
>   at org.apache.spark.sql.catalyst.expressions.StringRegexExpression.compile(regexpExpressions.scala:50)
>   at org.apache.spark.sql.catalyst.expressions.StringRegexExpression.pattern(regexpExpressions.scala:53)
>   at org.apache.spark.sql.catalyst.expressions.StringRegexExpression.nullSafeEval(regexpExpressions.scala:56)
>   at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419)
>   ...
> scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
> +---+-------------+
> |a\b|a\b LIKE a\\b|
> +---+-------------+
> |a\b|         true|
> +---+-------------+
> scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
> org.apache.spark.sql.AnalysisException: the pattern 'a\b' is invalid, the escape character
is not allowed to precede 'b';
>   at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>   at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
>   at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
>   ...
> // test data
> spark.sql("""SELECT * FROM test""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |  ok|  ok|
> |  Ok|  ok|
> | a_b| a_b|
> | aab| a_b|
> | c%d| c%d|
> |caad| c%d|
> |e\nf|e\nf|
> | e
> f|e\nf|
> +----+----+
> // both column-based
> // not escaping \
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_])',
'\\\\$1')""").show()
> ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> org.apache.spark.sql.AnalysisException: the pattern 'e\nf' is invalid, the escape character
is not allowed to precede 'n';
> 	at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
> 	at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
> 	at org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
> 	...
> // escaping \
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, '([%_\\\\])',
'\\\\$1')""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |  ok|  ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // LHS column-based, RHS literal
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\nf'""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |e\nf|e\nf|
> +----+----+
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\\\nf'""").show()
> +---+---+
> | _1| _2|
> +---+---+
> +---+---+
> {code}
> So in Spark 2.2:
> * Escaping \ in the RHS is mandatory if RHS is not a literal, otherwise we get an error.
So far so good.
> * If LHS is also a literal, same applies.
> * If RHS is a literal and LHS is a column, it seems that the string literal escaping
of \ interferes with the {{LIKE}} escaping, since re-escaping \ does not match anymore. I
would expect needing \\​​\​\​ between quotes to match a single \ (one escaping for
the string literal, one for the {{LIKE}} escaping).
> On Spark 1.6 (and it seems until Spark 2.2.0):
> {code}
> // both LHS & RHS literals
> scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
> +---+-----+
> |_c0|  _c1|
> +---+-----+
> |  \|false|
> +---+-----+
> scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
> +---+-----+
> |_c0|  _c1|
> +---+-----+
> |  \|false|
> +---+-----+
> scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
> +---+----+
> |_c0| _c1|
> +---+----+
> |a\b|true|
> +---+----+
> scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
> +---+----+
> |_c0| _c1|
> +---+----+
> |a\b|true|
> +---+----+
> // Same test data as for 2.2
> // both column-based
> // not escaping \
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`,
'([%_])', '\\\\$1')""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |  ok|  ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // escaping \
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`,
'([%_\\\\])', '\\\\$1')""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |  ok|  ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // LHS column-based, RHS literal
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\nf'""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |e\nf|e\nf|
> +----+----+
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\\\nf'""").show()
> +---+---+
> | _1| _2|
> +---+---+
> +---+---+
> {code}
> So in Spark 1.6:
> * Cannot match a single-character literal string of \
> * Matching a string (literal or column) containing \ works whether you escape \ or not,
which makes me think there is room for ambiguity (would \​\​n match \n?)
> * If RHS is a literal and LHS is a column, same issue as Spark 2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message