lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Woodward <alan.woodw...@romseysoftware.co.uk>
Subject Re: Building FST-like automaton queries
Date Tue, 28 Feb 2012 13:42:42 GMT

On 28 Feb 2012, at 13:31, Michael McCandless wrote:

> Neat :)  It's like a FuzzyQuery w/ a custom (binary?) cost matrix for
> the insert/delete/transposition changes...
> 
> Is the number of edits smallish?  Ie you're not concerned about
> combinatoric explosion of step 1?

We're only allowing expansions within an edit distance of 1, which should keep the numbers
of terms down. 

> 
> For steps 2 and 3 you shouldn't use FST at all.  Instead, for 2) use
> BasicAutomata.makeString(String) on each of your expanded terms, then
> BasicOperations.union on all of those automata to make a single
> automaton accepting all your expanded terms, then likely call
> .determinize() on the resulting automaton (maybe also .minimize() but
> I think that may not help).  Then pass that automaton to AQ.

Excellent, thanks for your help.  I'll give that a go.

> 
> We don't yet have a way to drive a query from an FST, but that would
> be an interesting addition.  EG you could then support weights as
> well, to decide how the terms are scored (if certain OCR errors are
> more likely than others).
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Tue, Feb 28, 2012 at 7:33 AM, Alan Woodward
> <alan.woodward@romseysoftware.co.uk> wrote:
>> Hello,
>> 
>> I'm trying to create a Lucene Query that will take a term and expand it to include
common OCR errors (for example, 'cl' is often misread as 'd', so a search for 'clog' should
also hit 'dog').  My plan is to do this by generating all the possible variants of a term,
using an existing list of errors, and then somehow mapping this into an AutomatonQuery.  I've
been looking around the o.a.l.util.automaton and o.a.l.util.fst packages on trunk, and I *think*
that this is possible, but I'm so far failing to work out how to put the various bits together.
>> 
>> I'm thinking it should work like this:
>> 1) expand query term to sorted list of possible matches
>> 2) create an FST over those matches
>> 3) plug this FST into an AutomatonQuery subclass.
>> 
>> 1) is easy.  It's 2) and 3) I'm having trouble with.
>> 
>> All help gratefully received!
>> 
>> Thanks,
>> 
>> Alan Woodward
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message