lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)
Date Sat, 05 Dec 2009 22:05:52 GMT
Could you be more specific :)

This patch is part of an issue to add an AutomatonQuery class to Lucene
that allows for a fast RegexpQuery and replaces our WildcardQuery impl.
Its being developed in two flavors - one for the current trunk version
of Lucene, and a slightly altered version for our "flexible indexing"
branch - which is a branch where another large issue is being developed
- eventually it will be merged back into trunk.

This might not be an issue where you want to get your feet wet ;) But if
you could be more explicit with what you want to know, we might be able
to be of more help. Thats a pretty broad question. To take a stab
anyway: the short of it is - find an issue you find compelling and jump
in ! :)

Ghazal Gharooni wrote:
> Hello,
>
> I am new in the community and I've completely been confused. Please
> anybody help me out to know which part of codes you are working with.
> How should I participate in work? Thank you!
>
>
>
>
> On Sat, Dec 5, 2009 at 1:02 PM, Uwe Schindler (JIRA) <jira@apache.org
> <mailto:jira@apache.org>> wrote:
>
>
>         [
>     https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>     ]
>
>     Uwe Schindler updated LUCENE-1606:
>     ----------------------------------
>
>        Attachment:     (was: LUCENE-1606-flex.patch)
>
>     > Automaton Query/Filter (scalable regex)
>     > ---------------------------------------
>     >
>     >                 Key: LUCENE-1606
>     >                 URL:
>     https://issues.apache.org/jira/browse/LUCENE-1606
>     >             Project: Lucene - Java
>     >          Issue Type: New Feature
>     >          Components: Search
>     >            Reporter: Robert Muir
>     >            Assignee: Robert Muir
>     >            Priority: Minor
>     >             Fix For: 3.1
>     >
>     >         Attachments: automaton.patch, automatonMultiQuery.patch,
>     automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
>     automatonWithWildCard.patch, automatonWithWildCard2.patch,
>     BenchWildcard.java, LUCENE-1606-flex.patch,
>     LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
>     LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
>     LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>     LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>     LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>     LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>     LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>     >
>     >
>     > Attached is a patch for an AutomatonQuery/Filter (name can
>     change if its not suitable).
>     > Whereas the out-of-box contrib RegexQuery is nice, I have some
>     very large indexes (100M+ unique tokens) where queries are quite
>     slow, 2 minutes, etc. Additionally all of the existing RegexQuery
>     implementations in Lucene are really slow if there is no constant
>     prefix. This implementation does not depend upon constant prefix,
>     and runs the same query in 640ms.
>     > Some use cases I envision:
>     >  1. lexicography/etc on large text corpora
>     >  2. looking for things such as urls where the prefix is not
>     constant (http:// or ftp://)
>     > The Filter uses the BRICS package
>     (http://www.brics.dk/automaton/) to convert regular expressions
>     into a DFA. Then, the filter "enumerates" terms in a special way,
>     by using the underlying state machine. Here is my short
>     description from the comments:
>     >      The algorithm here is pretty basic. Enumerate terms but
>     instead of a binary accept/reject do:
>     >
>     >      1. Look at the portion that is OK (did not enter a reject
>     state in the DFA)
>     >      2. Generate the next possible String and seek to that.
>     > the Query simply wraps the filter with ConstantScoreQuery.
>     > I did not include the automaton.jar inside the patch but it can
>     be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.
>
>     --
>     This message is automatically generated by JIRA.
>     -
>     You can reply to this email to add a comment to the issue online.
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>     <mailto:java-dev-unsubscribe@lucene.apache.org>
>     For additional commands, e-mail: java-dev-help@lucene.apache.org
>     <mailto:java-dev-help@lucene.apache.org>
>
>


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message