lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: frequent terms - Re: combining open office spellchecker with Lucene
Date Sat, 11 Sep 2004 03:40:54 GMT
Doug Cutting wrote:

> David Spencer wrote:
> 
>> Doug Cutting wrote:
>>
>>> And one should not try correction at all for terms which occur in a 
>>> large proportion of the collection.
>>
>>
>>
>> I keep thinking over this one and I don't understand it. If a user 
>> misspells a word and the "did you mean" spelling correction algorithm 
>> determines that a frequent term is a good suggestion, why not suggest 
>> it? The very fact that it's common could mean that it's more likely 
>> that the user wanted this word (well, the heuristic here is that users 
>> frequently search for frequent terms, which is probabably wrong, but 
>> anyway..).
> 
> 
> I think you misunderstood me.  What I meant to say was that if the term 
> the user enters is very common then spell correction may be skipped. 
> Very common words which are similar to the term the user entered should 
> of course be shown.  But if the user's term is very common one need not 
> even attempt to find similarly-spelled words.  Is that any better?

Yes, sure, thx, I understand now - but maybe not - the context I was 
something like this:

[1] The user enters a query like:
     recursize descent parser

[2] The search code parses this and sees that the 1st word is not a term 
in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of frequency, 
  it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in 
the query that are not in the index at all).


> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message