lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Date Tue, 14 Sep 2004 19:28:48 GMT
Tate Avery wrote:

> I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp

How embarassing!
Sorry!
Fixed!


> 
> 
> T
> 
> -----Original Message-----
> From: David Spencer [mailto:dave-lucene-user@tropo.com]
> Sent: Tuesday, September 14, 2004 3:23 PM
> To: Lucene Users List
> Subject: NGramSpeller contribution -- Re: combining open office
> spellchecker with Lucene
> 
> 
> Andrzej Bialecki wrote:
> 
> 
>>David Spencer wrote:
>>
>>
>>>I can/should send the code out. The logic is that for any terms in a 
>>>query that have zero matches, go thru all the terms(!) and calculate 
>>>the Levenshtein string distance, and return the best matches. A more 
>>>intelligent way of doing this is to instead look for terms that also 
>>>match on the 1st "n" (prob 3) chars.
>>
>>
>>...or prepare in advance a fast lookup index - split all existing terms 
>>to bi- or trigrams, create a separate lookup index, and then simply for 
>>each term ask a phrase query (phrase = all n-grams from an input term), 
>>with a slop > 0, to get similar existing terms. This should be fast, and 
>>you could provide a "did you mean" function too...
>>
> 
> 
> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
> phases. First you build a "fast lookup index" as mentioned above. Then 
> to correct a word you do a query in this index based on the ngrams in 
> the misspelled word.
> 
> Let's see.
> 
> [1] Source is attached and I'd like to contribute it to the sandbox, esp 
> if someone can validate that what it's doing is reasonable and useful.
> 
> [2] Here's a demo page. I built an ngram index for ngrams of length 3 
> and 4 based on the existing index I have of approx 100k 
> javadoc-generated pages. You type in a misspelled word like "recursixe" 
> or whatnot to see what suggestions it returns. Note this is not a normal 
> search index query -- rather this is a test page for spelling corrections.
> 
> http://www.searchmorph.com/kat/spell.jsp
> 
> [3] Here's the javadoc:
> 
> http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html
> 
> [4] Here's source in HTML:
> 
> http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
> 
> [5] A few more details:
> 
> Based on a subsequent mail in this thread I set boosts for the words in 
> the ngram index. The background is each word (er..term for a given 
> field) in the orig index is a separate Document in the ngram index. This 
> Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
> 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
> that more frequently words will tend to be suggested more often.
> 
> I think in "plain" English then the way a word is suggested as a 
> spelling correction is:
> - frequently occuring words score higher
> - words that share more ngrams with the orig word score higher
> - words that share rare ngrams with the orig word score higher
> 
> [6]
> 
> If people want to vote me in as a committer to the sandbox then I can 
> check this code in - though again, I'd appreciate feedback.
> 
> thx,
>   Dave
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message