Return-Path: Delivered-To: apmail-lucene-solr-dev-archive@locus.apache.org Received: (qmail 84545 invoked from network); 22 Feb 2008 22:12:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 22 Feb 2008 22:12:06 -0000 Received: (qmail 84132 invoked by uid 500); 22 Feb 2008 22:12:01 -0000 Delivered-To: apmail-lucene-solr-dev-archive@lucene.apache.org Received: (qmail 83825 invoked by uid 500); 22 Feb 2008 22:12:00 -0000 Mailing-List: contact solr-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-dev@lucene.apache.org Received: (qmail 83816 invoked by uid 99); 22 Feb 2008 22:12:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Feb 2008 14:12:00 -0800 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [208.97.132.202] (HELO spunkymail-a18.g.dreamhost.com) (208.97.132.202) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Feb 2008 22:11:24 +0000 Received: from [192.168.0.3] (adsl-074-229-189-244.sip.rmo.bellsouth.net [74.229.189.244]) by spunkymail-a18.g.dreamhost.com (Postfix) with ESMTP id 5DCC15B52F for ; Fri, 22 Feb 2008 14:11:30 -0800 (PST) Message-Id: <12D13F54-A27B-4939-96E4-98AE29E75B24@apache.org> From: Grant Ingersoll To: solr-dev@lucene.apache.org In-Reply-To: <47BF38BE.2040802@aol.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v919.2) Subject: Re: Spell checking ?'s Date: Fri, 22 Feb 2008 17:11:31 -0500 References: <893423.76365.qm@web50305.mail.re2.yahoo.com> <47BF38BE.2040802@aol.com> X-Mailer: Apple Mail (2.919.2) X-Virus-Checked: Checked by ClamAV on apache.org Yeah, context can play a role, but that is up to the Analyzer used to determine. I will open a JIRA issue to address the problem as it exists now and a fix to do the analysis before submitting the terms. -Grant On Feb 22, 2008, at 4:03 PM, Sean Timm wrote: > Sometimes context can play into the correct spelling of a term. I > haven't looked at the 1.3 spell check stuff, but it would be nice to > do term n-gramming in order to check the terms in context. > > Since Otis brought up Google, here is an example of putting the term > into context. > http://www.google.com/search?q=choudhury > http://www.google.com/search?q=abdur+choudhury > > -Sean > > Otis Gospodnetic wrote: >> Haven't used SCRH in a while, but what you are describing sounds >> right (thinking about how Google does it) - each word should be >> checked separately and we shouldn't assume splitting on >> whitespace. I'm trying to think if there are cases where you'd >> want to look at the surrounding terms instead of looking at each >> term in isolation.... can think of anything exciting....maybe >> ensure that words with dashes are properly handled. >> >> Otis >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> ----- Original Message ---- >> >>> From: Grant Ingersoll >>> To: solr-dev@lucene.apache.org >>> Sent: Thursday, February 21, 2008 3:13:20 PM >>> Subject: Spell checking ?'s >>> >>> Hi, >>> >>> I've been looking a bit at the spell checker and the >>> implementation in the SpellCheckerRequestHandler and I have some >>> questions. >>> >>> In looking at the code and the wiki, the SpellChecker seems to >>> treat multiword queries differently depending on whether >>> extendedResults is true or not. Is the use case a multiword >>> query or a single word query? It seems like one would want to >>> pass the whole query to the spell checker and have it come back >>> with results for each word, by default. Otherwise, the >>> application would need to do the tokenization and send each term >>> one by one to the spell checker. However, the app likely doesn't >>> have access to the spell check tokenizer, so this is difficult. >>> >>> Which leads me to the next question, in the extendedResults, >>> shouldn't it use the Query analyzer for the spellcheck field to >>> tokenize the terms instead of splitting on the space character? >>> >>> Would it make sense to, for extendedResults anyway, do the >>> following: >>> Tokenize the query using the query analyzer for the spelling field >>> for each token >>> spell check the token >>> add the results >>> >>> I see that extendedResults is a 1.3 addition, so we would be fine >>> to change it, if it makes sense. >>> >>> Perhaps, for back compatibility, we keep the existing way for non >>> extendedResults. However, it seems like multiword queries should >>> be split even in the non-extended results, but I am not sure. >>> How are others using it? >>> >>> Thanks, >>> Grant >>> >>> >> >> >> -------------------------- Grant Ingersoll http://www.lucenebootcamp.com Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ