impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jin Chul Kim <jinc...@gmail.com>
Subject Re: [DISCUSS: IMPALA-3282] Character escapes in regular expressions
Date Thu, 21 Dec 2017 04:53:39 GMT
Hi,

I've pushed an initial change: https://gerrit.cloudera.org/#/c/8900/
The change contains essential feature only:
- Function name: regexp_escape
- Takes a string as a input parameter and returns a string which is escaped.
- Escapes the following special characters: ".*\\+?^[](){}$!=:-#\n\r\t\v "
(not contain double quote. the use of double quotes is not to hide a space.)

Best regards,
Jinchul

2017-12-19 11:12 GMT+09:00 Jin Chul Kim <jinchul@gmail.com>:

> Hi,
>
> I would like to discuss some issues before taking the ticket which expects
> a new builtin function(e.g. string regex_escape(string_pattern)). The
> purpose of the function is to escape a set of special characters by
> replacing the string pattern with their escaped characters.
>
> 1. Define candidates of escaped characters
> When I research the escape on other languages, interestingly there are
> some differences and features in each language.
>
> We should set our escaped characters. Here is a summary of the above
> discussion:
>
> - Perl: Escapes every character that is not alphanumeric(i.e.
> [A-Za-z_0-9]).
> - PHP: Escapes the following special characters: . \ + * ? [ ^ ] $ ( ) { }
> = ! < > | : -
> - Python: Same as Perl's approach, but the character underscore is no
> longer escaped since version 3.3.
> - Ruby: Escapes the following special characters: [ ] { } ( ) | - * . \ ?
> + ^ $ #
> Ruby Escapes comments(#), but do not escape context sensitive characters(:
> <)
> - Java: A different approach. Java relies on "as if it were a literal
> pattern" by "\Q" and "\E"
> - C#: Escapes the following special characters: \ * + ? | { [ ( ) ^ $ . #
> whitespace
> C# does not escapes ] and }.
>
> See the discussion if you want to see more details: https://github.com/
> benjamingr/RegExp.escape/blob/master/data/other_languages/discussions.md
>
> 2. Built-in function name
> The reporter proposed "regex_escape". I think the function name is
> intuitive and self-explainable. Please suggest if you have any better name.
>
> 3. Signature of the built-in function
> Do we have to extend function signature? I guess an user may want to pass
> a set of customized characters.
>
> regex_escape(string_pattern, [delimiter])
>
> delimiter
>   := "^[A-Za-z0-9]"
>   | "[.\?\[^()\]{}=!<>|:-]"
>
> "^[A-Za-z0-9]" means "escapes non-alphanumeric characters"
> "[.\?\[^()\]{}=!<>|:-]" means "escapes the specified characters"
> In delimiter, the following characters should be escaped: []
>
> Best regards,
> Jinchul
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message