Mailing-List: contact solr-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-dev@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Message-Id: <12D13F54-A27B-4939-96E4-98AE29E75B24@apache.org>
From: Grant Ingersoll <gsingers@apache.org>
To: solr-dev@lucene.apache.org
In-Reply-To: <47BF38BE.2040802@aol.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v919.2)
Subject: Re: Spell checking ?'s
Date: Fri, 22 Feb 2008 17:11:31 -0500
References: <893423.76365.qm@web50305.mail.re2.yahoo.com>
 <47BF38BE.2040802@aol.com>

Yeah, context can play a role, but that is up to the Analyzer used to  
determine.  I will open a JIRA issue to address the problem as it  
exists now and a fix to do the analysis before submitting the terms.

-Grant

On Feb 22, 2008, at 4:03 PM, Sean Timm wrote:

> Sometimes context can play into the correct spelling of a term.  I  
> haven't looked at the 1.3 spell check stuff, but it would be nice to  
> do term n-gramming in order to check the terms in context.
>
> Since Otis brought up Google, here is an example of putting the term  
> into context.
> http://www.google.com/search?q=choudhury
> http://www.google.com/search?q=abdur+choudhury
>
> -Sean
>
> Otis Gospodnetic wrote:
>> Haven't used SCRH in a while, but what you are describing sounds  
>> right (thinking about how Google does it) - each word should be  
>> checked separately and we shouldn't assume splitting on  
>> whitespace.  I'm trying to think if there are cases where you'd  
>> want to look at the surrounding terms instead of looking at each  
>> term in isolation.... can think of anything exciting....maybe  
>> ensure that words with dashes are properly handled.
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>> ----- Original Message ----
>>
>>> From: Grant Ingersoll <gsingers@apache.org>
>>> To: solr-dev@lucene.apache.org
>>> Sent: Thursday, February 21, 2008 3:13:20 PM
>>> Subject: Spell checking ?'s
>>>
>>> Hi,
>>>
>>> I've been looking a bit at the spell checker and the  
>>> implementation in  the SpellCheckerRequestHandler and I have some  
>>> questions.
>>>
>>> In looking at the code and the wiki, the SpellChecker seems to  
>>> treat  multiword queries differently depending on whether  
>>> extendedResults is  true or not.  Is the use case a multiword  
>>> query or a single word  query? It seems like one would want to  
>>> pass the whole query to the  spell checker and have it come back  
>>> with results for each word, by  default.  Otherwise, the  
>>> application would need to do the tokenization  and send each term  
>>> one by one to the spell checker.  However, the app  likely doesn't  
>>> have access to the spell check tokenizer, so this is  difficult.
>>>
>>> Which leads me to the next question, in the extendedResults,  
>>> shouldn't  it use the Query analyzer for the spellcheck field to  
>>> tokenize the  terms instead of splitting on the space character?
>>>
>>> Would it make sense to, for extendedResults anyway, do the  
>>> following:
>>> Tokenize the query using the query analyzer for the spelling field
>>> for each token
>>>    spell check the token
>>>    add the results
>>>
>>> I see that extendedResults is a 1.3 addition, so we would be fine  
>>> to  change it, if it makes sense.
>>>
>>> Perhaps, for back compatibility, we keep the existing way for non   
>>> extendedResults.  However, it seems like multiword queries should  
>>> be  split even in the non-extended results, but I am not sure.   
>>> How are  others using it?
>>>
>>> Thanks,
>>> Grant
>>>
>>>
>>
>>
>>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ