lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.
Date Wed, 02 May 2012 08:52:51 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266442#comment-13266442
] 

Dawid Weiss commented on LUCENE-3830:
-------------------------------------

bq. Hmm I don't quite understand ... can you describe more?

I'm on vacation this week, so short: from the brief code inspection I concluded you're using
a search from each position to get the maximum match, is this right? If so, the pessimistic
time is quite bad for patterns like "aaa*b" and input strings "aaaa*c" (replace * with a large
number of repeated sequences). The way I would try to implement it is to do a state-tracking
much like in non-deterministic automata, where you have a stack of "active" states which you
follow in the automaton and you move them from state to state on each input symbol. If there
a given state fires as a match then you have some conditions to check -- are there any states
that may be potentially longer but haven't fired yet (if so, you need to delay this firing
state because it can be subsumed by others), otherwise when the no other active state blocks
that one you can do the replacement.

I haven't really thought about how twisted the logic above can become but I suspect it's not
going to be really that bad. The gain is that each input symbol advances your state (at most
linearly with the number of active states). It helps with degenerate cases like the one above.
I suppose what Mihov et al. do is to statically (on the fst) determine which states can lead
to this "deferred match queue" and simply eliminate them, but haven't really looked into it.

It's an improvement, doesn't need to be implemented right away. Sorry for being brief -- the
network is flaky here, I'm in the middle of nowhere.

                
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
>                 Key: LUCENE-3830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3830
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3830.patch
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns.
The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would
be ideal here and provide both memory and speed improvement I bet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message