lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2265) improve automaton performance by running on byte[]
Date Fri, 23 Apr 2010 15:58:49 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860295#action_12860295
] 

Robert Muir commented on LUCENE-2265:
-------------------------------------

So here are the advantages of the current patch:
* full unicode support (Regular Expression, Wildcard, Fuzzy). for example, wildcard ? means
codepoint, not code unit.
* support for matching all unicode forms easily (utf8, utf16, utf32). 
* easy to support both native utf8 terms sort order, but also utf8-in-utf16 like we have now.
this is not feasible with the existing utf16 representation.
* easy to safely do dfa operations on Automaton. this is because there are no surrogates anymore.
for example we can safely reverse any automaton to take advantage of Solr's leading wildcard
support (e.g. support "leading" regexps, too)
* better compatibility with lucene, because automaton is in sync with the terms format (byte).
This could lead to future optimizations like TermsEnum exposing the 'shared prefix' of a term
with the previous enumerated term.

Unfortunately, there are currently a few disadvantages with the patch, but I think we can
resolve these:
* The linear fuzzy terms enum, from the old code, needs to be fixed and consistent and use
utf32 calculations, too.
* for huge dfas (eg fuzzy) there is some cost to the conversion, around 5ms one-time cost
on my machine for very long strings. perhaps we can optimize some code here, its not blowing
up though.

So in my opinion, the first thing should be resolved before committing, and the second is
nice-to-have and shouldn't block the improvement.


> improve automaton performance by running on byte[]
> --------------------------------------------------
>
>                 Key: LUCENE-2265
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2265
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: Flex Branch
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch,
LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265_pare.patch,
LUCENE-2265_utf32.patch
>
>
> Currently, when enumerating terms, automaton must convert entire terms from flex's native
utf-8 byte[] to char[] first, then step each char thru the state machine.
> we can make this more efficient, by allowing the state machine to run on byte[], so it
can return true/false faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message