lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
Date Tue, 24 Nov 2009 20:57:39 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782159#action_12782159
] 

Robert Muir commented on LUCENE-1606:
-------------------------------------

{quote}
I think I'm confused - if the query is ???1234, the common suffix is
1234, and so shouldn't the DFA tell us the next XXX1234 term to try to
seek to (and we should never use next() on the enum)?
{quote}

not really the suffix, but more general, the structure of the dfa itself tells us that for
???1234, if you are evaluating 5551234, that the next term should really be 5561234
you can tell this by basically 'walking the graph'

but this knowledge is not always available, sometimes we only have 'partial' knowledge of
where to go next.
The culprit here is the * operator, because it eats up anything :)
So when you walk the graph, the * operator is this giant monster in your way. 

so sometimes, depending on the dfa (which depends on the wildcard or regex used to construct
it), 
automaton knows enough to seek to all the exact locations, if its finite (like ???1234)

sometimes, it only knows partial information. when a loop is encountered walking the graph
(the giant monster), it has to stop and only use the path information it knows so far.
for example a wildcard of abcd*1234, the best it can do is seek to abcd.

your description of ping-pong is right, this is how these situations are handled.

right now, the way the enum works, is i don't even bother seeking until i hit a mismatch.
you can see this in the comments 'as long as there is a match, keep reading. this is an optimization
when many terms are matched sequentially like ab*'

i tested this along time ago, perhaps we should re-test to see if its appropriate?


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
>                 Key: LUCENE-1606
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1606
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch,
automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch,
BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch,
LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+
unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing
RegexQuery implementations in Lucene are really slow if there is no constant prefix. This
implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular
expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the
underlying state machine. Here is my short description from the comments:
>      The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject
do:
>       
>      1. Look at the portion that is OK (did not enter a reject state in the DFA)
>      2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/
and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message