Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 66320 invoked from network); 3 Dec 2007 15:44:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Dec 2007 15:44:51 -0000 Received: (qmail 63283 invoked by uid 500); 3 Dec 2007 15:44:32 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 63206 invoked by uid 500); 3 Dec 2007 15:44:32 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 63195 invoked by uid 99); 3 Dec 2007 15:44:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Dec 2007 07:44:32 -0800 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of smokeystu@gmail.com designates 209.85.146.178 as permitted sender) Received: from [209.85.146.178] (HELO wa-out-1112.google.com) (209.85.146.178) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Dec 2007 15:44:12 +0000 Received: by wa-out-1112.google.com with SMTP id j40so5713855wah for ; Mon, 03 Dec 2007 07:44:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; bh=kOAc1kcl2jFApVy9ppdmLfHqPEBB79UOXyjCWE8YUBY=; b=FxbBPXp9EoS3PWjgbWw0PFXijk9OpND4S509+eaczcUWJUfrG1u5aLCyBC8JOIFdFg65TZGwT+hDdraXYn8pLjh6euCFsNcFnhGzU9zKbWnYZkGXxrq/4cS//GVxkDrvDO5UbTOUy7Vs6BBD0FdDd/tj7MyuwAnNLPioOp+wZG0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=Dagq3v7IItbG1Wlii1/eOIyY7Xposu9TfvOBJ64SjrKmho3SL649rL7VX0swQtr65gQxanOj+W+FcGXK1dgX6YzKI6mGqSRHZL+AI7Z38y3woUsXHoun78S2+DJsYlxgfXUiQFrlkXKq/dyJK3IrJNGhglYSz8PUfBxLs4dtfKE= Received: by 10.142.246.8 with SMTP id t8mr359502wfh.1196696654821; Mon, 03 Dec 2007 07:44:14 -0800 (PST) Received: by 10.143.16.2 with HTTP; Mon, 3 Dec 2007 07:44:14 -0800 (PST) Message-ID: Date: Mon, 3 Dec 2007 10:44:14 -0500 From: smokey To: java-user@lucene.apache.org Subject: Re: Applying SpellChecker to a phrase In-Reply-To: <359a92830712030512s46a750c1l5ef460b62efd8353@mail.gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_16469_17612556.1196696654816" References: <359a92830712030512s46a750c1l5ef460b62efd8353@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_16469_17612556.1196696654816 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline I have not tried this yet. I am trying to understand the best practices from others who have experiences with SpellChecker before actually implementing it. If I understand it correctly, the spell check class suggests alternate but similar words for a single input term. So I believe I will have to parse the phrase string and apply spell checker for each member term to construct the final expanded query. I don't think there is a higher level support that lets me apply spell check to a phase and do query.toString() to examine how it internally expanded the query (although it would have been nice to have something like that - has anyone written or found such class?) As for performance, we're dealing with hundreds of indexes where each index typically grows well above 1G in size, so performance is the single most important factor to consider. On Dec 3, 2007 8:12 AM, Erick Erickson wrote: > Have you actually tried this and done a query.toString() to see > how this is actually expanded? Not that I'm all that familiar > with SpellChecker, but before presuming how things work > you would get answers faster if you ran a test..... > > And, why do you care about performance? I know that's > a silly question, but you haven't supplied any parameters > about your index and usage to give us a clue whether this > matters. If your index is 3M, you'll never see the difference > between the two ways of expanding the query. If your > index is distributed over 10 machines and is 1T, you really, > really, really care. > > And under any circumstances, you can always generate > your own query of the second form by a bit of pre-processing. > > More info please..... > > Best > Erick > > On Dec 2, 2007 10:14 PM, smokey wrote: > > > Suppose I have an index containing the terms impostor, imposter, fraud, > > and > > fruad, then presumably regardless of whether I spell impostor and fraud > > correctly, Lucene SpellChecker will offer the improperly spelled > versions > > as > > corrections. This means that the phrase "The login fraud involves an > > impostor" would need to expand to: > > > > "The login fraud involves an impostor" OR "The login fruad involves an > > impostor" OR "The login fraud involves an imposter" OR "The login fruad > > involves an imposter" to cover all cases and thus find all possible > > matches. > > > > However, that feels like an aweful a lot of matches to perform on the > > index. > > A more efficient approach would be to expand the query to "The login > > (fraud > > OR fruad) involves an (impostor OR imposter)", which should be logically > > equivalent to the first (longer) query. > > > > So my question is > > (1) if others have generated the "The login (fraud OR fruad) involves an > > (impostor OR imposter)" types of queries when applying SpellChecker to a > > phrase, and agreed that this indeed performs better than the first one. > > (2) if others have observed any problems in doing so in terms of > > performance > > or anything else > > > > Any information would be appreciated. > > > ------=_Part_16469_17612556.1196696654816--