lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: frequent terms - Re: combining open office spellchecker with Lucene
Date Tue, 14 Sep 2004 21:35:20 GMT
David Spencer wrote:
> [1] The user enters a query like:
>     recursize descent parser
> [2] The search code parses this and sees that the 1st word is not a term 
> in the index, but the next 2 are. So it ignores the last 2 terms 
> ("recursive" and "descent") and suggests alternatives to 
> "recursize"...thus if any term is in the index, regardless of frequency, 
>  it is left as-is.
> I guess you're saying that, if the user enters a term that appears in 
> the index and thus is sort of spelled correctly ( as it exists in some 
> doc), then we use the heuristic that any sufficiently large doc 
> collection will have tons of misspellings, so we assume that rare terms 
> in the query might be misspelled (i.e. not what the user intended) and 
> we suggest alternativies to these words too (in addition to the words in 
> the query that are not in the index at all).


If the user enters "a recursize purser", then: "a", which is in, say, 
 >50% of the documents, is probably spelled correctly and "recursize", 
which is in zero documents, is probably mispelled.  But what about 
"purser"?  If we run the spell check algorithm on "purser" and generate 
"parser", should we show it to the user?  If "purser" occurs in 1% of 
documents and "parser" occurs in 5%, then we probably should, since 
"parser" is a more common word than "purser".  But if "parser" only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting "parser".

If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does "purser" or "parser" occur 
more frequently near "descent".  But that gets expensive.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message