lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmed El-dawy <aseld...@gmail.com>
Subject Re: Reordering search results
Date Thu, 06 Oct 2005 12:28:07 GMT
Thanks for your help.
I used PhraseQuery to boost close terms. I think of an idea for sop
words but I don't know, if it has any drawbacks. I can index any dummy
Token in place of all stop words. This token will never be searched
but it will be counted as a Token and will make a space between words.
Does this solution has any drawbacks?


On 10/3/05, Joaquin Delgado <joaquin.delgado@oracle.com> wrote:
> Chris, you may consider using a modified version of the Nutch analysis
> (http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/package-summary.html)
> which has a very slick treatment of stopwords. Please refer to chapter
> 4, page 145 of the Lucene in Action written by Eric and Otis for some
> details about the nutch implementation.
>
> -- J.D.
>
> Erik Hatcher wrote:
>
> >
> > On Oct 3, 2005, at 4:56 AM, Chris Lamprecht wrote:
> >
> >>> 1- Words in Document that are more close to original search terms  have
> >>> a larger Score. For example, if I was searching for "wellcome",
> >>> Document("wellcome") must be better than Document("welcome")
> >>>
> >>
> >> I'm just "thinking outloud" here, but some ideas that come to mind
> >> are:  Index both the original text (with spelling errors), and the
> >> spelling-corrected text.  When you search, search on both the
> >> corrected text, and in a non-required query clause search on the
> >> uncorrected text, maybe boosted down a bit.  This way, if the spelling
> >> was correct, it will match both the original term and the corrected
> >> term (since they're the same), but a document with a misspelling would
> >> match only the corrected term.  You'll have to experiment with boosts
> >> and relevance/rankings here.
> >>
> >> Another idea is, if you know the number of misspellings made at
> >> indexing time (it seems like you do), then boost documents based on
> >> the number of spelling errors -- higher boost factor for fewer errors.
> >
> >
> > Another tip is that score is based on term frequency - so when
> > tokenizing correct spellings, add multiple of the correct words to
> > weight towards them.
> >
> >>> 2- Documents that have search terms close to each other, have a  larger
> >>> Score. For example, if I was searching for "welcome there",
> >>> Document("welcome there") must be better than Document("welcome all
> >>> there"). Note that "all" is a stop word in my implementation.
> >>>
> >>
> >> PhraseQuery with a high slop factor (MAX_INT works) scores higher for
> >> terms that are closer together.  You can construct the PhraseQuery
> >> yourself (programmatically), or QueryParser takes it as:
> >>
> >> "welcome there"~99999
> >>
> >> (with the quotes)  99999 is the slop factor, which means to accept
> >> documents where "welcome" is within 99999 positions from "there".
> >
> >
> > The issue is that "all" is a stop word, though.  The StopFilter does
> > not leave a hole when stop words are removed, so indexing "welcome
> > all there" is exactly the same as indexing "welcome there" as far as
> > the index is concerned.  I started to address this situation in the
> > 1.4.x Lucene releases but it introduced a backward incompatible issue
> > so we reverted.  Care must be taken on the Query side of things -
> > PhraseQuery did not deal with anything but term position increments
> > of 1, but this has been addressed in the latest codebase (in
> > Subversion).
> >
> > I built a PositionalStopFilter for and discussed these details in the
> > Analysis chapter of "Lucene in Action" - it is available in the  code
> > .zip at http://www.lucenebook.com
> >
> >     Erik
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


--
regards,
Ahmed Saad

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message