lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-2265) improve automaton performance by running on byte[]
Date Tue, 06 Apr 2010 04:32:29 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir updated LUCENE-2265:
--------------------------------

    Attachment: LUCENE-2265.patch

ok I think i made some serious progress here, but i did find a bug in the utf32 -> utf8
dfa convertor.
The problem is it does not handle at least the case where the initial state is an accept state.
I created a testcase for this (TestUTF32SpecialCase), and included the python code back, as
i figure you will probably fix it there first.

I deleted the surrogate-seeking tests, like other nuances, if we switch to byte[] these won't
behave the same, as these regexps 
are no longer defined.

remaining is to switch the slow fuzzy to use codepoint calculations (to be consistent with
the fast one).
by the way, its really silly we have to unicode-convert just to get length in chars for that
score calculation... ugh!


> improve automaton performance by running on byte[]
> --------------------------------------------------
>
>                 Key: LUCENE-2265
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2265
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: Flex Branch
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: Flex Branch
>
>         Attachments: LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch,
LUCENE-2265.patch, LUCENE-2265_pare.patch, LUCENE-2265_utf32.patch
>
>
> Currently, when enumerating terms, automaton must convert entire terms from flex's native
utf-8 byte[] to char[] first, then step each char thru the state machine.
> we can make this more efficient, by allowing the state machine to run on byte[], so it
can return true/false faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message