Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 21421 invoked from network); 12 Jun 2006 22:05:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 12 Jun 2006 22:05:15 -0000 Received: (qmail 30504 invoked by uid 500); 12 Jun 2006 22:05:09 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 30467 invoked by uid 500); 12 Jun 2006 22:05:09 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 30454 invoked by uid 99); 12 Jun 2006 22:05:09 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Jun 2006 15:05:09 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of vnguyen@ur.com designates 63.241.148.20 as permitted sender) Received: from [63.241.148.20] (HELO ironport.ur.com) (63.241.148.20) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Jun 2006 15:05:08 -0700 Received: from unknown (HELO UREXCHSRV5.ur.com) ([10.6.134.35]) by ironport.ur.com with ESMTP; 12 Jun 2006 18:04:47 -0400 X-OriginatingIP: 10.6.134.35 Received: from UREXCHVS3.ur.com ([10.6.138.34]) by UREXCHSRV5.ur.com with Microsoft SMTPSVC(6.0.3790.1830); Mon, 12 Jun 2006 18:04:46 -0400 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: question with spellchecker Date: Mon, 12 Jun 2006 18:09:20 -0400 Message-ID: <0D6A3C278F4DC346B98DF4D2F1397E81126CF22F@UREXCHVS3.ur.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: question with spellchecker Thread-Index: AcaKE7dzyHeJCVmWTrexNT06NhKrTgEWRU2Q From: "Van Nguyen" To: X-OriginalArrivalTime: 12 Jun 2006 22:04:46.0105 (UTC) FILETIME=[373DD890:01C68E6C] X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I'll experiment with both. Thanks... -----Original Message----- From: mark harwood [mailto:markharw00d@yahoo.co.uk]=20 Sent: Wednesday, June 07, 2006 2:16 AM To: java-user@lucene.apache.org Subject: Re: question with spellchecker I think the problem in your particular example is the suggestion software has no consideration of context. I've been playing with context-sensitive suggestions recently which take a bunch of validated (ie existing) words (eg "tape") and use this to help shortlist alternatives for an unknown or partially typed word (eg ducted) This has potential applications in spell checking and as-you-type query completion. The approach is quite simple but effective - You use your choice of code to produce a list of candidate terms (eg FuzzyTermEnum or some form of Soundex or PrefixQuery) THEN take the large list of candidate terms produced and compare their usage in relation to the context of words you already know eg "tape". In practice this means that TermDocs for the candidate term are used to construct a doc bitset which is compared with a doc bitset produced from all other terms which make up the context. The level of intersection between these bitsets can be used to help sensibly rank the "duct" and "ducked" candidates in relation to "tape". Do they co-occur often? [psuedo code] BitSet contextDocs=3D matchKnownTerms(); float numContextMatches=3DcontextDocs.cardinality(); for all candidate terms for unknown term { BitSet candMatches =3DcreateBitset(candTerm) float numCandMatches=3DcandMatches.cardinality(); float numSharedMatches=3DcandMatches.and(contextDocs).cardinality() float contextRelatedness =3DnumSharedMatches/ ( (numCandMatches+numContextMatches) -numSharedMatches ) //collect candidate Terms that have high combo of=20 //contextRelatedness and unknown term similarity (eg low edit distance)=20 } There are quite a few optimisations I've added to this basic pseudo code in my implementation. When I get some time I'll package this code up and contribute it but for now this psuedo code may give some pointers which help to provide a solution. Cheers, Mark Send instant messages to your online friends http://uk.messenger.yahoo.com=20 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org