lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: Contribution: better multi-field searching
Date Wed, 13 Oct 2004 19:32:49 GMT
Doug,

That approach does not work.  I could not find an approach that would
work with the built-in classes, although of course there might be one.
The problem has two components:  coord and the fact that BooleanQuery's
sum their clause scores to compute the final score.  The latter is not
easily overridden.  Specifically,

  title:(albino elephant)^4 description:(albino elephant)

still has the problem that a result with albino in the title and albino
in the description gets the same score as a result with albino in the
title and elephant in the description (assuming tf and idf scores are
not relevant, and if they are they may make the problem worse rather
than better).

In fact the query above, without overriding coord, is equivalent to

  (title:albino title:elephant description:albino description:elephant)

which makes the root cause of the problem apparent (no distinction
between the same term in different fields vs. different terms in
different fields).  The coord override improves the score for the case
of single fields that contain both terms, but still does not
differentiate between the cases of same term in different fields and
different terms in different fields.

FYI, MaxDisjunctionQuery has made an enormous improvement in the quality
of my query results, and I have strong reason to believe the same would
be true in most other domains (more on that coming in the idf^2
discussion).  In terms of the albino elephant example, the query above
was putting all the albino animals except elephants above the albino
elephants, while the query with an outer BooleanQuery and inner
MaxDisjunctionQuery's

    ( (title:albino^4 | description:albino)~0.1
      (title:elephant^4 | description:elephant)~0.1
    )

properly puts the albino elephants on top.

Chuck


> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Wednesday, October 13, 2004 9:57 AM
> To: Lucene Developers List
> Subject: Re: Contribution: better multi-field searching
> 
> Chuck Williams wrote:
> > The issue is this.  Imagine you have two fields, title and document,
> > both of which you want to search with simple queries like:  albino
> > elephant.  There are two general approaches, either a) create a
combined
> > field that concatenates the two individual fields, or b) expand the
> > simple query into a BooleanQuery that searches for each term in both
> > fields.
> >
> > With approach a), you lose the flexibility to set separate boost
factors
> > on the individual fields.  I wanted title to be much more important
than
> > description for ranking results, and wanted to control this
explicitly,
> > as length norm was not always doing the right thing; e.g.,
descriptions
> > are not always long.
> >
> > With approach b) you run into another problem.  Suppose the example
> > query is expanded into (title:albino description:albino
title:elephant
> > description:elephant).  Then, assuming tf/idf doesn't affect
ranking, a
> > document with albino in both title and description will score the
same
> > as a document with albino in title and elephant in description.  The
> > latter document for most applications is much better since it
matches
> > both query terms.  If albino is the more important term according to
> > idf, then the less desirable documents (albino in both fields) will
rank
> > consistently ahead of the albino elephants (which is what was
happening
> > to me, yielding horrible results).
> 
> Another way to handle this would be to generate a query like:[Chuck
Williams] Y
> 
>    title:(albino elephant) description(albino elephant)
> 
> In this case the coord factor would boost titles and descriptions
which
> contained both terms.  You may or may not want to disable the coord
> factor for the outer query, which can be done with:
> 
> BooleanQuery title = new BooleanQuery();
> title.add(new TermQuery(new Term("title", "albino")), false, false);
> title.add(new TermQuery(new Term("title", "elephant")), false, false);
> 
> BooleanQuery desc = new BooleanQuery();
> desc.add(new TermQuery(new Term("desc", "albino")), false, false);
> desc.add(new TermQuery(new Term("desc", "elephant")), false, false);
> 
> BooleanQuery outer = new BooleanQuery() {
>    public getSimilarity() {
>      new DefaultSimilarity() {
>        public coord(int overlap, int length) { return 1.0f; }
>      }
>    }
> };
> outer.add(title, false, false);
> outer.add(desc, false, false);
> 
> In general, doesn't coord() handle this situation?
> 
> Also, you can separately boost title and desc here, if you like:
> 
>    title:(albino elephant)^4.0 description(albino elephant)
> 
> or
> 
> title.boost(4.0f);
> 
> 
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message