lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avi Rosenschein <arosensch...@gmail.com>
Subject Re: boosts for unstemmed matches (was Re: If you could have one feature in Lucene...)
Date Wed, 24 Feb 2010 21:38:35 GMT
On Wed, Feb 24, 2010 at 11:20 PM, Aaron Lav <asl2@pobox.com> wrote:

> On Wed, Feb 24, 2010 at 10:18:27PM +0200, Avi Rosenschein wrote:
> > On Wed, Feb 24, 2010 at 3:42 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >
> > > What would it be?
> > >
> >
> > For scoring to take into account the non-analyzed token stream.
> >
> > That is, if a field is analyzed (stemmed, lowercased, maybe even stop
> words
> > removed), that is fine for indexing. But tokens in the query matching the
> > original form could still get a higher score than those that only match
> when
> > analyzed.
>
> You can get some of that effect by indexing stemmed and unstemmed
> forms, and letting IDF boost unstemmed results.  (I picked this
> idea up from http://lingpipe-blog.com/2007/03/21/to-stem-or-not-to-stem/)
>

This is not quite the same (either in relevance or efficiency). I would like
the infrastructure for this to be built into Lucene, so that  queries and
scorers could take advantage of it.


> > Also, this would maybe allow a flexible, run-time, decision of what
> > analyzers to include. For example, I might want stemming turned on for
> > normal search, but not for a PhraseQuery.
>
> That's harder - different field names for the different analyses might
> work, but not for run-time decisions.  I think the way Sun's Minion does
> it is morphologically-based query expansion (see
> http://blogs.sun.com/searchguy/entry/lightweight_morphology_vs_stemming),
> and you might be able to
> implement that via query rewriting.
>

Again, rather than forcing me to store a separate field for every possible
type of query I might want to build, Lucene should be able to efficiently
store the original information in a form conducive to using at query time.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message