lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2265) improve automaton performance by running on byte[]
Date Fri, 02 Apr 2010 23:47:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853007#action_12853007
] 

Robert Muir commented on LUCENE-2265:
-------------------------------------

{quote}
So? You aren't making some generic automaton library, are you?
Get these user regexes/wildcards/etc, convert them to utf-8, build utf-8 automaton, run it
against lucene data. 
{quote}

This just pushes the complexity into the parsers. and yes, it makes sense to support high-level
(char[]) operations
with automaton too, such as analysis.

I encourage you to take a look at the existing code. In general a lot of parsers (see wildcard
and regex) are implemented 
with primitive automata like 'makeAnyChar'. 'makeAnyByte' makes no sense.

So its generic in the sense that fuzzy, regex, wildcard, all of our users are defined on unicode
characters. high
level operations such as parsing, intersection, and union belong in utf16 or utf32 space,
not with bytes.

bytes is an implementation detail, and we shouldnt operate on UTF-8 except behind the scenes.

> improve automaton performance by running on byte[]
> --------------------------------------------------
>
>                 Key: LUCENE-2265
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2265
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: Flex Branch
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: Flex Branch
>
>         Attachments: LUCENE-2265.patch
>
>
> Currently, when enumerating terms, automaton must convert entire terms from flex's native
utf-8 byte[] to char[] first, then step each char thru the state machine.
> we can make this more efficient, by allowing the state machine to run on byte[], so it
can return true/false faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message