lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Spell checking ?'s
Date Fri, 22 Feb 2008 22:11:31 GMT
Yeah, context can play a role, but that is up to the Analyzer used to  
determine.  I will open a JIRA issue to address the problem as it  
exists now and a fix to do the analysis before submitting the terms.


On Feb 22, 2008, at 4:03 PM, Sean Timm wrote:

> Sometimes context can play into the correct spelling of a term.  I  
> haven't looked at the 1.3 spell check stuff, but it would be nice to  
> do term n-gramming in order to check the terms in context.
> Since Otis brought up Google, here is an example of putting the term  
> into context.
> -Sean
> Otis Gospodnetic wrote:
>> Haven't used SCRH in a while, but what you are describing sounds  
>> right (thinking about how Google does it) - each word should be  
>> checked separately and we shouldn't assume splitting on  
>> whitespace.  I'm trying to think if there are cases where you'd  
>> want to look at the surrounding terms instead of looking at each  
>> term in isolation.... can think of anything exciting....maybe  
>> ensure that words with dashes are properly handled.
>> Otis
>> --
>> Sematext -- -- Lucene - Solr - Nutch
>> ----- Original Message ----
>>> From: Grant Ingersoll <>
>>> To:
>>> Sent: Thursday, February 21, 2008 3:13:20 PM
>>> Subject: Spell checking ?'s
>>> Hi,
>>> I've been looking a bit at the spell checker and the  
>>> implementation in  the SpellCheckerRequestHandler and I have some  
>>> questions.
>>> In looking at the code and the wiki, the SpellChecker seems to  
>>> treat  multiword queries differently depending on whether  
>>> extendedResults is  true or not.  Is the use case a multiword  
>>> query or a single word  query? It seems like one would want to  
>>> pass the whole query to the  spell checker and have it come back  
>>> with results for each word, by  default.  Otherwise, the  
>>> application would need to do the tokenization  and send each term  
>>> one by one to the spell checker.  However, the app  likely doesn't  
>>> have access to the spell check tokenizer, so this is  difficult.
>>> Which leads me to the next question, in the extendedResults,  
>>> shouldn't  it use the Query analyzer for the spellcheck field to  
>>> tokenize the  terms instead of splitting on the space character?
>>> Would it make sense to, for extendedResults anyway, do the  
>>> following:
>>> Tokenize the query using the query analyzer for the spelling field
>>> for each token
>>>    spell check the token
>>>    add the results
>>> I see that extendedResults is a 1.3 addition, so we would be fine  
>>> to  change it, if it makes sense.
>>> Perhaps, for back compatibility, we keep the existing way for non   
>>> extendedResults.  However, it seems like multiword queries should  
>>> be  split even in the non-extended results, but I am not sure.   
>>> How are  others using it?
>>> Thanks,
>>> Grant

Grant Ingersoll
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:

View raw message