lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ype Kingma <ykin...@xs4all.nl>
Subject Re: User documentation for scoring
Date Mon, 24 Feb 2003 18:28:35 GMT
Terry, Otis, Clemens,

One more try to correct the lucene-dev email address.
Monday evening blues I guess...


>Ype,
>
>I couldn't find/open any attachment.  Would you try to send it to me
>directly?  I'd very much like to read and help revise the document.

O well, I changed email program and it _said_ it would attach.
Anyway, here is the html, it's not very clean, but it displays
nicely here.

Sorry for the teaser,
Ype



<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="GENERATOR" content="Mozilla/4.79 (Macintosh; U; PPC) [Netscape]">
   <title>lcnscoring.html</title>
</head>
<body>

<h1>
Scoring in Lucene</h1>

<h2>
&nbsp;Introduction</h2>
The scoring capabilities of&nbsp; the Lucene search engine are explored.
The target audience is primarily the users of Lucene. The intention is
to help these users understand the default document scores computed by
Lucene for their queries, and how the default scoring mechanism can be
adapted.
<p>The Lucene scoring mechanism in Lucene 1.3dev is used in this document.This
interface is available as java class Similarity in package org.apache.lucene.search
from the Lucene web site http://jakarta.apache.org/lucene, revision 1.2,
29 Jan 2003.
<h2>
Basic operation of Lucene</h2>
In Lucene, a document consists of fields, and a field consists of terms.
To execute queries Lucene analyses and indexes documents.
<h3>
Indexing and querying</h3>
During indexing, an analyser software component extracts terms from each
document.The analyser normally splits the documents into fields and words
and removes stop words.For each field these words are then stored in a
Lucene index as terms.
<p>Once the index is ready it can be used for querying. A query consists
of terms and phrases and results in a list of scored documents.
<p>During query term preprocessing the query terms are normally analysed
by the same analyzer that was used to build the index. At this point weights
for terms and phrases are established.
<p>During query search the indexes and the weights are used to assign a
score to each document.
<h3>
Term based scoring</h3>
The scoring part of Lucene is called the Similarity interface, because
it determines how similar a document is to a query. It is also referred
to as the 'custom scoring API'.
<p>Lucene has a default implementation of this Similarity interface. By
using another implementation the scoring can be changed. Such a change
requires a straightforward program adaptation. This document indicates
which Java methods of the Similarity interface can be changed. Each of
these java methods represents a part of the scoring mechanism.
<p>For scoring the following aspects of the Lucene query language are important:
fields, terms, phrases, and query weights.
<p>The default scoring method is based on an well established scoring mechanism
for simple terms. It uses the logarithmic inverse document frequency for
the term.
<p>(Check: give a reference?).
<p>Other query elements are truncated and imprecise terms and phrases.
Their scoring is done by working back to this well established term scoring
mechanism.
<h3>
Scoring of truncation and imprecision</h3>
The Lucene query language allows truncation and imprecise queries (~ operator
for terms and phrases), their influence on scoring is currently not completely
known to the author.
<h2>
&nbsp;Field weighing during indexing</h2>
For each document, field length and name dependencies must be set during
document indexing time, The field default weight is the inverse square
root of the number of terms it contains for the document.
<p>This field weighting can be used to give the fields of some documents
an advantage over other documents, eg. as a result of citation analysis.
<p>Field weighting can also be used to provide a minimum length for short
fields (eg. titles) in order to prevent these from scoring high only because
of their short length.
<p>Another use is to provide a higher weight for fields with a priori known
higher relevance.
<p>Since field weights for queries can be also adapted in queries, field
weighting during indexing is most useful to distinguish documents from
each other.
<p>Field weights in the index have about 1 decimal digit (3 bits) precision,
they are stored as a single byte for each field of each
<br>document.
<p>Changes in the document field weights require that all documents are
reindexed.
<p>Java method:&nbsp; lengthNorm(fieldName, numberOfTerms)
<h2>
&nbsp;Query term preprocessing</h2>
Initial query processing retrieves the document frequencies of the terms
in the query. This is combined with the query weights to form the term
and phrase weights to be used during query search.
<h3>
Query term weight and query phrase weight</h3>
Since this is part of the query language there is no corresponding java
method in the Lucene Similarity API.
<p>The term or phrase weight given the query (default 1) is the first term
weighing factor. It can be used ao. to compensate for unwanted effects
of&nbsp; other term weighting as described below.
<p>The Lucene query language allows to require the presence of a query
term or phrase in a document field. Higher term or phrase frequencies within
scored documents can eg. be obtained by using a higher query weight.
<h3>
Inverse document frequency of a term</h3>
Another weight for a term within a query can depend on its document frequency
(the number of documents in which the term occurs) and the total number
of documents taking part in the search. The default for this weight is
(1 + log(numDocs/(docFreq + 1))), ie. a
<br>term score is lower when more documents contain the term.
<p>Check: document frequencies of truncated query terms and imprecise query
terms.
<p>Java method: idf(docFreq, numDocs), idf stands for 'inverse document
frequency'.
<h3>
Inverse document frequency of a phrase</h3>

<p><br>For a phrase the document frequency is not available before the
query is evaluated. By default, the inverse document frequencies of the
individual terms in a phrase are summed to provide a phrase weighing factor.
<p>Java method: idf(terms, searcher)
<br>The searcher here is the java object that executes the query.
<h3>
Query norm</h3>
To make scores from different queries comparable, a query norm function
is used, which is provided with the sum of the squared weights of all the
query terms. This function does not affect the ranking order for a single
query.
<br>By default this function is the inverse of the square root.
<p>Java method: queryNorm( sumOfQuaredWeights).
<br>&nbsp;
<h2>
&nbsp;Query search</h2>
For each document that satisfies the query, the search extracts the following
information from the indexes. Here 'field' is used in the
<br>sense of a field of the document being searched.
<ol>
<li>
- field weight of the field in which a query term or phrase occurs,</li>

<li>
- query term frequency within the field,</li>

<li>
- query phrase frequency within the field,</li>

<li>
- edit distance for imprecise query terms and query phrases within the
field,</li>

<li>
- the number of different query terms within the document.</li>
</ol>

<h3>
Term or phrase frequency in a document</h3>
The frequency of a term or phrase within a document field is available
for scoring. By default, the square root of this frequency is used.
<p>Java method: tf( frequency)
<h3>
Imprecise occurrences</h3>
For imprecise phrase matches,&nbsp; the 'edit distance' to a phrase is
also available for scoring. The edit distance is a measure of how imprecise
the match is.
<p>It is used to compute the contribution of the match to the total frequency
of the phrase in the document field. By default this 'sloppy
<br>frequency contribution' is 1/(distance + 1).
<p>The precise meaning of 'edit distance' needs further investigation.
<p>Check:
<br>For phrases the edit distance is computed using term proximity information
from the index.
<br>For terms the edit distance is the minimum number of single character
edits (modify, insert, delete) between the query term and the occurring
term.
<p>Java method: sloppyFreq( distance)
<h3>
Query document overlap</h3>
The number of different query terms that a document contains (ie. the overlap)
and the number of terms in the query are used for another factor indicating
how well the document matches the query as a whole. This allows to take
into account the number of different non required terms occuring in the
document.
<p>Check: As truncated query terms are equivalent to the OR of all matching
terms in the index, truncation can result in a large maxOverlap.
<p>By default this factor is (overlap / nrQueryTerms). (The API documentation
uses maxOverlap for nrQueryTerms, to be investigated).
<p>Java method: coord(overlap, maxOverlap)
<h2>
Scoring formulas</h2>
The following formulas determine how the document score for a query is
computed.
<br>&nbsp;
<h3>
Query preprocessing</h3>
&nbsp;
<table BORDER COLS=3 WIDTH="100%" >
<tr>
<td>numDocs</td>

<td>&nbsp;</td>

<td>The number of documents in the database, from the index</td>
</tr>

<tr>
<td>docFreq</td>

<td>&nbsp;</td>

<td>The number of documents in which a query term occurs, from the index</td>
</tr>

<tr>
<td>qtw</td>

<td>&nbsp;</td>

<td>The query weight of a term or phrase, from the query</td>
</tr>

<tr>
<td>tw</td>

<td>qtw * idf(docFreq, numDocs)</td>

<td>Weight of a term in the query</td>
</tr>

<tr>
<td>tw</td>

<td>qtw * idf(terms, searcher)</td>

<td>Weight of a phrase in the query</td>
</tr>

<tr>
<td>qn</td>

<td>queryNorm( SUM(tw * tw))</td>

<td>The query norm, summing over all terms and phrases in the query</td>
</tr>
</table>
&nbsp;
<br>&nbsp;
<h3>
Query search</h3>
During search the actual occurences of terms and phrases in the document
are taken into account. Here 'field' is used in the sense of a field of
the document being scored. Occurrence is used in the sense of occurence
in a field.
<br>&nbsp;
<br>&nbsp;
<table BORDER COLS=3 WIDTH="100%" >
<tr>
<td>freq</td>

<td>&nbsp;</td>

<td>
<br>&nbsp;Number of times a term occurs in a field, from the index.</td>
</tr>

<tr>
<td>distance</td>

<td>&nbsp;</td>

<td>&nbsp;See 'Imprecise occurrences'.</td>
</tr>

<tr>
<td>
<br>overlap</td>

<td>&nbsp;</td>

<td>&nbsp;See 'Query document overlap'.</td>
</tr>

<tr>
<td>maxOverlap</td>

<td>&nbsp;</td>

<td>&nbsp;See 'Query document overlap'.</td>
</tr>

<tr>
<td>fw</td>

<td>lengthNorm(fieldName, numberOfTerms)</td>

<td>Field weight of the field of an occurrence</td>
</tr>

<tr>
<td>tfs&nbsp;</td>

<td>tw * fw * tf( freq)</td>

<td>&nbsp;Score of a term in a field</td>
</tr>

<tr>
<td>ssf</td>

<td>SUM(sloppyFreq( distance))</td>

<td>'Frequency' of imprecise occurrences</td>
</tr>

<tr>
<td>tfs</td>

<td>tw * fw * tf(ssf))</td>

<td>&nbsp;Score of imprecise occurrences</td>
</tr>

<tr>
<td>tds</td>

<td>SUM(tfs)</td>

<td>Total score of occurrences</td>
</tr>

<tr>
<td>crd</td>

<td>coord(overlap, maxOverlap)</td>

<td>Query document overlap</td>
</tr>

<tr>
<td>docscore</td>

<td>qn * tds * crd&nbsp;</td>

<td>Document score for the query</td>
</tr>
</table>
&nbsp;
</body>
</html>


-- 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message