hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Teddy Choi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-4642) Implement vectorized RLIKE and REGEXP filter expressions
Date Thu, 13 Jun 2013 14:48:20 GMT

    [ https://issues.apache.org/jira/browse/HIVE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682299#comment-13682299
] 

Teddy Choi commented on HIVE-4642:
----------------------------------

[~ehans], I read some books and papers about regular expressions. According to http://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times
, DFA construction time and backtracking NFA running time are exponential, so they are not
good solutions.

According to http://swtch.com/~rsc/regexp/regexp1.html and http://www.cs.ucdavis.edu/~green/papers/techrept02.pdf
, lazy DFA could be a good choice. It's construction time is O(n ), and its running time is
O(m^2*n) at most. If there are enough target strings and they share a similar pattern, then
the average running time will become O(n ). So it looks promising.

It's hard to find a Java lazy DFA regular expression engine. java.util.regex is traditional
NFA, and JRegex is DFA. Jarkta Regex is retired, so is Jarkarta ORO. And others are not updated
for years. I'm surprised that it's hard to find one. So I think it will be good to implement
one by myself. Fortunately, it doesn't seem hard to implement.

If you know an alternative solution, please let me know it.
                
> Implement vectorized RLIKE and REGEXP filter expressions
> --------------------------------------------------------
>
>                 Key: HIVE-4642
>                 URL: https://issues.apache.org/jira/browse/HIVE-4642
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Eric Hanson
>            Assignee: Teddy Choi
>
> See title. I will add more details next week. The goal is (a) make this work correctly
and (b) optimize it as well as possible, at least for the common cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message