lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3830) MappingCharFilter could be improved by switching to an FST.
Date Thu, 03 May 2012 22:48:48 GMT


Michael McCandless commented on LUCENE-3830:

bq.  from the brief code inspection I concluded you're using a search from each position to
get the maximum match, is this right?

Right.  I mean, that's how the code currently works ... I'm just
replacing the recursive Map w/ an FST.

bq. If so, the pessimistic time is quite bad for patterns like "aaa*b" and input strings "aaaa*c"
(replace * with a large number of repeated sequences). 

Yes... though I suspect in practice the impact is minimal for Lucene's
typical use cases.  Still it would be nice to fix :)

bq. The way I would try to implement it is to do a state-tracking much like in non-deterministic
automata, where you have a stack of "active" states which you follow in the automaton and
you move them from state to state on each input symbol.

OK!  I think I understand... I think this amounts to putting a .*
cycle on the FST's start node, and then doing subset construction to
"determinize" as you traverse?  Ie track all states that the input
characters could be in... so that you only traverse the input once.

bq. I suppose what Mihov et al. do is to statically (on the fst) determine which states can
lead to this "deferred match queue" and simply eliminate them, but haven't really looked into

Yeah I think I think the paper (and Aho/Corasick) essentially do that
determinization "up front".

SynonymFilter also has the same limitation....

> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>                 Key: LUCENE-3830
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>         Attachments: LUCENE-3830.patch
> MappingCharFilter stores an overly complex tree-like structure for matching input patterns.
The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would
be ideal here and provide both memory and speed improvement I bet.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message