lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <>
Subject Re: question with spellchecker
Date Wed, 07 Jun 2006 09:16:10 GMT
I think the problem in your particular example is the
suggestion software has no consideration of context.

I've been playing with context-sensitive suggestions
recently which take a bunch of validated (ie existing)
words (eg "tape") and use this to help shortlist
alternatives for an unknown or partially typed word
(eg ducted)

This has potential applications in spell checking and
as-you-type query completion.

The approach is quite simple but effective - You use
your choice of code to produce a list of candidate
terms  (eg FuzzyTermEnum or some form of Soundex or
PrefixQuery) THEN take the large list of candidate
terms produced and compare their usage in relation to
the context of  words you already know eg "tape".
In practice this means that TermDocs for the candidate
term are used to construct a doc bitset which is
compared with a doc bitset produced from all other
terms which make up the context. The level of
intersection between these bitsets can be used to help
sensibly rank the "duct" and "ducked" candidates in
relation to "tape". Do they co-occur often?

[psuedo code]
BitSet contextDocs= matchKnownTerms();
float numContextMatches=contextDocs.cardinality();
for all candidate terms for unknown term
  BitSet candMatches =createBitset(candTerm)
  float numCandMatches=candMatches.cardinality();
  float contextRelatedness =numSharedMatches/
  //collect candidate Terms that have high combo of 
  //contextRelatedness and unknown term similarity (eg
low edit distance) 

There are quite a few optimisations I've added to this
basic pseudo code in my implementation. When I get
some time I'll package this code up and contribute it
but for now this psuedo code may give some pointers
which help to provide a solution.


Send instant messages to your online friends 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message