Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AD4DF906F for ; Tue, 28 Feb 2012 13:43:20 +0000 (UTC) Received: (qmail 35350 invoked by uid 500); 28 Feb 2012 13:43:18 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 35293 invoked by uid 500); 28 Feb 2012 13:43:18 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 35282 invoked by uid 99); 28 Feb 2012 13:43:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Feb 2012 13:43:18 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [93.93.131.52] (HELO haggis.mythic-beasts.com) (93.93.131.52) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Feb 2012 13:43:11 +0000 Received: from [92.28.104.92] (helo=[192.168.0.143]) by haggis.mythic-beasts.com with esmtpsa (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.69) (envelope-from ) id 1S2NKL-0003XP-QK for java-user@lucene.apache.org; Tue, 28 Feb 2012 13:42:50 +0000 Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1251.1) Subject: Re: Building FST-like automaton queries From: Alan Woodward In-Reply-To: Date: Tue, 28 Feb 2012 13:42:42 +0000 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.1251.1) X-BlackCat-Spam-Score: -26 X-Mythic-Debug: Threshold = On = X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-2.7 On 28 Feb 2012, at 13:31, Michael McCandless wrote: > Neat :) It's like a FuzzyQuery w/ a custom (binary?) cost matrix for > the insert/delete/transposition changes... >=20 > Is the number of edits smallish? Ie you're not concerned about > combinatoric explosion of step 1? We're only allowing expansions within an edit distance of 1, which = should keep the numbers of terms down.=20 >=20 > For steps 2 and 3 you shouldn't use FST at all. Instead, for 2) use > BasicAutomata.makeString(String) on each of your expanded terms, then > BasicOperations.union on all of those automata to make a single > automaton accepting all your expanded terms, then likely call > .determinize() on the resulting automaton (maybe also .minimize() but > I think that may not help). Then pass that automaton to AQ. Excellent, thanks for your help. I'll give that a go. >=20 > We don't yet have a way to drive a query from an FST, but that would > be an interesting addition. EG you could then support weights as > well, to decide how the terms are scored (if certain OCR errors are > more likely than others). >=20 > Mike McCandless >=20 > http://blog.mikemccandless.com >=20 > On Tue, Feb 28, 2012 at 7:33 AM, Alan Woodward > wrote: >> Hello, >>=20 >> I'm trying to create a Lucene Query that will take a term and expand = it to include common OCR errors (for example, 'cl' is often misread as = 'd', so a search for 'clog' should also hit 'dog'). My plan is to do = this by generating all the possible variants of a term, using an = existing list of errors, and then somehow mapping this into an = AutomatonQuery. I've been looking around the o.a.l.util.automaton and = o.a.l.util.fst packages on trunk, and I *think* that this is possible, = but I'm so far failing to work out how to put the various bits together. >>=20 >> I'm thinking it should work like this: >> 1) expand query term to sorted list of possible matches >> 2) create an FST over those matches >> 3) plug this FST into an AutomatonQuery subclass. >>=20 >> 1) is easy. It's 2) and 3) I'm having trouble with. >>=20 >> All help gratefully received! >>=20 >> Thanks, >>=20 >> Alan Woodward >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >>=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org