lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: User documentation for scoring
Date Wed, 16 Apr 2003 05:51:17 GMT
Oh, one more thing.  You wouldn't happen to have this document in
'xdocs format' as well?  If you do, please send that version.  If not,
I'll either try converting it or I'll just stick the HTML version in
CVS.

Otis


--- Ype Kingma <ykingma@xs4all.nl> wrote:
> Terry, Otis,
> 
> >Ype,
> >
> >I couldn't find/open any attachment.  Would you try to send it to me
> >directly?  I'd very much like to read and help revise the document.
> 
> O well, I changed email program and it _said_ it would attach.
> Anyway, here is the html, it's not very clean, but it displays
> nicely here.
> 
> Sorry for the teaser,
> Ype
> 
> 
> 
> <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
> <html>
> <head>
>    <meta http-equiv="Content-Type" content="text/html;
> charset=iso-8859-1">
>    <meta name="GENERATOR" content="Mozilla/4.79 (Macintosh; U; PPC)
> [Netscape]">
>    <title>lcnscoring.html</title>
> </head>
> <body>
> 
> <h1>
> Scoring in Lucene</h1>
> 
> <h2>
> &nbsp;Introduction</h2>
> The scoring capabilities of&nbsp; the Lucene search engine are
> explored.
> The target audience is primarily the users of Lucene. The intention
> is
> to help these users understand the default document scores computed
> by
> Lucene for their queries, and how the default scoring mechanism can
> be
> adapted.
> <p>The Lucene scoring mechanism in Lucene 1.3dev is used in this
> document.This
> interface is available as java class Similarity in package
> org.apache.lucene.search
> from the Lucene web site http://jakarta.apache.org/lucene, revision
> 1.2,
> 29 Jan 2003.
> <h2>
> Basic operation of Lucene</h2>
> In Lucene, a document consists of fields, and a field consists of
> terms.
> To execute queries Lucene analyses and indexes documents.
> <h3>
> Indexing and querying</h3>
> During indexing, an analyser software component extracts terms from
> each
> document.The analyser normally splits the documents into fields and
> words
> and removes stop words.For each field these words are then stored in
> a
> Lucene index as terms.
> <p>Once the index is ready it can be used for querying. A query
> consists
> of terms and phrases and results in a list of scored documents.
> <p>During query term preprocessing the query terms are normally
> analysed
> by the same analyzer that was used to build the index. At this point
> weights
> for terms and phrases are established.
> <p>During query search the indexes and the weights are used to assign
> a
> score to each document.
> <h3>
> Term based scoring</h3>
> The scoring part of Lucene is called the Similarity interface,
> because
> it determines how similar a document is to a query. It is also
> referred
> to as the 'custom scoring API'.
> <p>Lucene has a default implementation of this Similarity interface.
> By
> using another implementation the scoring can be changed. Such a
> change
> requires a straightforward program adaptation. This document
> indicates
> which Java methods of the Similarity interface can be changed. Each
> of
> these java methods represents a part of the scoring mechanism.
> <p>For scoring the following aspects of the Lucene query language are
> important:
> fields, terms, phrases, and query weights.
> <p>The default scoring method is based on an well established scoring
> mechanism
> for simple terms. It uses the logarithmic inverse document frequency
> for
> the term.
> <p>(Check: give a reference?).
> <p>Other query elements are truncated and imprecise terms and
> phrases.
> Their scoring is done by working back to this well established term
> scoring
> mechanism.
> <h3>
> Scoring of truncation and imprecision</h3>
> The Lucene query language allows truncation and imprecise queries (~
> operator
> for terms and phrases), their influence on scoring is currently not
> completely
> known to the author.
> <h2>
> &nbsp;Field weighing during indexing</h2>
> For each document, field length and name dependencies must be set
> during
> document indexing time, The field default weight is the inverse
> square
> root of the number of terms it contains for the document.
> <p>This field weighting can be used to give the fields of some
> documents
> an advantage over other documents, eg. as a result of citation
> analysis.
> <p>Field weighting can also be used to provide a minimum length for
> short
> fields (eg. titles) in order to prevent these from scoring high only
> because
> of their short length.
> <p>Another use is to provide a higher weight for fields with a priori
> known
> higher relevance.
> <p>Since field weights for queries can be also adapted in queries,
> field
> weighting during indexing is most useful to distinguish documents
> from
> each other.
> <p>Field weights in the index have about 1 decimal digit (3 bits)
> precision,
> they are stored as a single byte for each field of each
> <br>document.
> <p>Changes in the document field weights require that all documents
> are
> reindexed.
> <p>Java method:&nbsp; lengthNorm(fieldName, numberOfTerms)
> <h2>
> &nbsp;Query term preprocessing</h2>
> Initial query processing retrieves the document frequencies of the
> terms
> in the query. This is combined with the query weights to form the
> term
> and phrase weights to be used during query search.
> <h3>
> Query term weight and query phrase weight</h3>
> Since this is part of the query language there is no corresponding
> java
> method in the Lucene Similarity API.
> <p>The term or phrase weight given the query (default 1) is the first
> term
> weighing factor. It can be used ao. to compensate for unwanted
> effects
> of&nbsp; other term weighting as described below.
> <p>The Lucene query language allows to require the presence of a
> query
> term or phrase in a document field. Higher term or phrase frequencies
> within
> scored documents can eg. be obtained by using a higher query weight.
> <h3>
> Inverse document frequency of a term</h3>
> Another weight for a term within a query can depend on its document
> frequency
> (the number of documents in which the term occurs) and the total
> number
> of documents taking part in the search. The default for this weight
> is
> (1 + log(numDocs/(docFreq + 1))), ie. a
> <br>term score is lower when more documents contain the term.
> <p>Check: document frequencies of truncated query terms and imprecise
> query
> terms.
> <p>Java method: idf(docFreq, numDocs), idf stands for 'inverse
> document
> frequency'.
> <h3>
> Inverse document frequency of a phrase</h3>
> 
> <p><br>For a phrase the document frequency is not available before
> the
> query is evaluated. By default, the inverse document frequencies of
> the
> individual terms in a phrase are summed to provide a phrase weighing
> factor.
> <p>Java method: idf(terms, searcher)
> <br>The searcher here is the java object that executes the query.
> <h3>
> Query norm</h3>
> To make scores from different queries comparable, a query norm
> function
> is used, which is provided with the sum of the squared weights of all
> the
> query terms. This function does not affect the ranking order for a
> single
> query.
> <br>By default this function is the inverse of the square root.
> <p>Java method: queryNorm( sumOfQuaredWeights).
> <br>&nbsp;
> <h2>
> &nbsp;Query search</h2>
> For each document that satisfies the query, the search extracts the
> following
> information from the indexes. Here 'field' is used in the
> <br>sense of a field of the document being searched.
> <ol>
> <li>
> - field weight of the field in which a query term or phrase
> occurs,</li>
> 
> <li>
> - query term frequency within the field,</li>
> 
> <li>
> - query phrase frequency within the field,</li>
> 
> <li>
> - edit distance for imprecise query terms and query phrases within
> the
> field,</li>
> 
> <li>
> - the number of different query terms within the document.</li>
> </ol>
> 
> <h3>
> Term or phrase frequency in a document</h3>
> The frequency of a term or phrase within a document field is
> available
> for scoring. By default, the square root of this frequency is used.
> <p>Java method: tf( frequency)
> <h3>
> Imprecise occurrences</h3>
> For imprecise phrase matches,&nbsp; the 'edit distance' to a phrase
> is
> also available for scoring. The edit distance is a measure of how
> imprecise
> the match is.
> <p>It is used to compute the contribution of the match to the total
> frequency
> of the phrase in the document field. By default this 'sloppy
> <br>frequency contribution' is 1/(distance + 1).
> <p>The precise meaning of 'edit distance' needs further
> investigation.
> <p>Check:
> <br>For phrases the edit distance is computed using term proximity
> information
> from the index.
> <br>For terms the edit distance is the minimum number of single
> character
> edits (modify, insert, delete) between the query term and the
> occurring
> term.
> <p>Java method: sloppyFreq( distance)
> <h3>
> Query document overlap</h3>
> The number of different query terms that a document contains (ie. the
> overlap)
> and the number of terms in the query are used for another factor
> indicating
> how well the document matches the query as a whole. This allows to
> take
> into account the number of different non required terms occuring in
> the
> document.
> <p>Check: As truncated query terms are equivalent to the OR of all
> matching
> terms in the index, truncation can result in a large maxOverlap.
> <p>By default this factor is (overlap / nrQueryTerms). (The API
> documentation
> uses maxOverlap for nrQueryTerms, to be investigated).
> <p>Java method: coord(overlap, maxOverlap)
> <h2>
> Scoring formulas</h2>
> The following formulas determine how the document score for a query
> is
> computed.
> <br>&nbsp;
> <h3>
> Query preprocessing</h3>
> &nbsp;
> <table BORDER COLS=3 WIDTH="100%" >
> <tr>
> <td>numDocs</td>
> 
> <td>&nbsp;</td>
> 
> <td>The number of documents in the database, from the index</td>
> </tr>
> 
> <tr>
> <td>docFreq</td>
> 
> <td>&nbsp;</td>
> 
> <td>The number of documents in which a query term occurs, from the
> index</td>
> </tr>
> 
> <tr>
> <td>qtw</td>
> 
> <td>&nbsp;</td>
> 
> <td>The query weight of a term or phrase, from the query</td>
> </tr>
> 
> <tr>
> <td>tw</td>
> 
> <td>qtw * idf(docFreq, numDocs)</td>
> 
> <td>Weight of a term in the query</td>
> </tr>
> 
> <tr>
> <td>tw</td>
> 
> <td>qtw * idf(terms, searcher)</td>
> 
> <td>Weight of a phrase in the query</td>
> </tr>
> 
> <tr>
> <td>qn</td>
> 
> <td>queryNorm( SUM(tw * tw))</td>
> 
> <td>The query norm, summing over all terms and phrases in the
> query</td>
> </tr>
> </table>
> &nbsp;
> <br>&nbsp;
> <h3>
> Query search</h3>
> During search the actual occurences of terms and phrases in the
> document
> are taken into account. Here 'field' is used in the sense of a field
> of
> the document being scored. Occurrence is used in the sense of
> occurence
> in a field.
> <br>&nbsp;
> <br>&nbsp;
> <table BORDER COLS=3 WIDTH="100%" >
> <tr>
> <td>freq</td>
> 
> <td>&nbsp;</td>
> 
> <td>
> <br>&nbsp;Number of times a term occurs in a field, from the
> index.</td>
> </tr>
> 
> <tr>
> <td>distance</td>
> 
> <td>&nbsp;</td>
> 
> <td>&nbsp;See 'Imprecise occurrences'.</td>
> </tr>
> 
> <tr>
> <td>
> <br>overlap</td>
> 
> <td>&nbsp;</td>
> 
> <td>&nbsp;See 'Query document overlap'.</td>
> </tr>
> 
> <tr>
> <td>maxOverlap</td>
> 
> <td>&nbsp;</td>
> 
> <td>&nbsp;See 'Query document overlap'.</td>
> </tr>
> 
> <tr>
> <td>fw</td>
> 
> <td>lengthNorm(fieldName, numberOfTerms)</td>
> 
> <td>Field weight of the field of an occurrence</td>
> </tr>
> 
> <tr>
> <td>tfs&nbsp;</td>
> 
> <td>tw * fw * tf( freq)</td>
> 
> <td>&nbsp;Score of a term in a field</td>
> </tr>
> 
> <tr>
> <td>ssf</td>
> 
> <td>SUM(sloppyFreq( distance))</td>
> 
> <td>'Frequency' of imprecise occurrences</td>
> </tr>
> 
> <tr>
> <td>tfs</td>
> 
> <td>tw * fw * tf(ssf))</td>
> 
> <td>&nbsp;Score of imprecise occurrences</td>
> </tr>
> 
> <tr>
> <td>tds</td>
> 
> <td>SUM(tfs)</td>
> 
> <td>Total score of occurrences</td>
> </tr>
> 
> <tr>
> <td>crd</td>
> 
> <td>coord(overlap, maxOverlap)</td>
> 
> <td>Query document overlap</td>
> </tr>
> 
> <tr>
> <td>docscore</td>
> 
> <td>qn * tds * crd&nbsp;</td>
> 
> <td>Document score for the query</td>
> </tr>
> </table>
> &nbsp;
> </body>
> </html>
> 
> 
> -- 


__________________________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message