lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Building FST-like automaton queries
Date Tue, 28 Feb 2012 13:31:54 GMT
Neat :)  It's like a FuzzyQuery w/ a custom (binary?) cost matrix for
the insert/delete/transposition changes...

Is the number of edits smallish?  Ie you're not concerned about
combinatoric explosion of step 1?

For steps 2 and 3 you shouldn't use FST at all.  Instead, for 2) use
BasicAutomata.makeString(String) on each of your expanded terms, then
BasicOperations.union on all of those automata to make a single
automaton accepting all your expanded terms, then likely call
.determinize() on the resulting automaton (maybe also .minimize() but
I think that may not help).  Then pass that automaton to AQ.

We don't yet have a way to drive a query from an FST, but that would
be an interesting addition.  EG you could then support weights as
well, to decide how the terms are scored (if certain OCR errors are
more likely than others).

Mike McCandless

http://blog.mikemccandless.com

On Tue, Feb 28, 2012 at 7:33 AM, Alan Woodward
<alan.woodward@romseysoftware.co.uk> wrote:
> Hello,
>
> I'm trying to create a Lucene Query that will take a term and expand it to include common
OCR errors (for example, 'cl' is often misread as 'd', so a search for 'clog' should also
hit 'dog').  My plan is to do this by generating all the possible variants of a term, using
an existing list of errors, and then somehow mapping this into an AutomatonQuery.  I've been
looking around the o.a.l.util.automaton and o.a.l.util.fst packages on trunk, and I *think*
that this is possible, but I'm so far failing to work out how to put the various bits together.
>
> I'm thinking it should work like this:
> 1) expand query term to sorted list of possible matches
> 2) create an FST over those matches
> 3) plug this FST into an AutomatonQuery subclass.
>
> 1) is easy.  It's 2) and 3) I'm having trouble with.
>
> All help gratefully received!
>
> Thanks,
>
> Alan Woodward
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message