lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)
Date Mon, 23 Nov 2009 15:38:39 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781431#action_12781431
] 

Robert Muir edited comment on LUCENE-1606 at 11/23/09 3:38 PM:
---------------------------------------------------------------

Mike, just one comment here.

I am definitely willing to do refactoring here to support this byte[] scheme if necessary,
I don't want to give the wrong impression. I think i already have an issue here related to
UTF-16 binary order vs UTF-8 binary order that I need to fix, although I think this is just
writing a Comparator.

edit: pretty sure this exists. If someone has say, both data from Arabic Presentation forms
and Chinese text outside the BMP in the index, the "smart" enumerator will unknowningly skip
right past the Arabic Presentation forms block, because it sorts after the lead surrogate
in UTF-16 order, but before the entire codepoint in UTF-8/UTF-32 order. I have not experienced
this in practice, because i normalize my text so i don't have stuff in Arabic Presentation
forms block :) I can fix this, but i would like to see what the approach is for the flex branch,
as its sufficiently compex that I would rather not fix it twice.

I am just concerned about other similar applications outside of lucene, or some already in
lucene core itself!


      was (Author: rcmuir):
    Mike, just one comment here.

I am definitely willing to do refactoring here to support this byte[] scheme if necessary,
I don't want to give the wrong impression. I think i already have an issue here related to
UTF-16 binary order vs UTF-8 binary order that I need to fix, although I think this is just
writing a Comparator.

I am just concerned about other similar applications outside of lucene, or some already in
lucene core itself!

  
> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
>                 Key: LUCENE-1606
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1606
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch,
automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch,
BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch,
LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+
unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing
RegexQuery implementations in Lucene are really slow if there is no constant prefix. This
implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular
expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the
underlying state machine. Here is my short description from the comments:
>      The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject
do:
>       
>      1. Look at the portion that is OK (did not enter a reject state in the DFA)
>      2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/
and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message