lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: User documentation for scoring
Date Wed, 16 Apr 2003 05:49:40 GMT
Thanks Ype, this looks useful.  I only skimmed it for now.  I will need
to clean up the HTML a bit before adding it to CVS.
There are also a few 'Check:' places that look like placeholders for
you to add more information.  Shall I just remove those or would you
like to add something to the document?  This can be done even after I
clean it up and add it to CVS.

Thanks again,
Otis

--- Ype Kingma <ykingma@xs4all.nl> wrote:
> Terry, Otis,
> 
> >Ype,
> >
> >I couldn't find/open any attachment.  Would you try to send it to me
> >directly?  I'd very much like to read and help revise the document.
> 
> O well, I changed email program and it _said_ it would attach.
> Anyway, here is the html, it's not very clean, but it displays
> nicely here.
> 
> Sorry for the teaser,
> Ype
> 
> 
> 
> <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
> <html>
> <head>
>    <meta http-equiv="Content-Type" content="text/html;
> charset=iso-8859-1">
>    <meta name="GENERATOR" content="Mozilla/4.79 (Macintosh; U; PPC)
> [Netscape]">
>    <title>lcnscoring.html</title>
> </head>
> <body>
> 
> <h1>
> Scoring in Lucene</h1>
> 
> <h2>
> &nbsp;Introduction</h2>
> The scoring capabilities of&nbsp; the Lucene search engine are
> explored.
> The target audience is primarily the users of Lucene. The intention
> is
> to help these users understand the default document scores computed
> by
> Lucene for their queries, and how the default scoring mechanism can
> be
> adapted.
> <p>The Lucene scoring mechanism in Lucene 1.3dev is used in this
> document.This
> interface is available as java class Similarity in package
> org.apache.lucene.search
> from the Lucene web site http://jakarta.apache.org/lucene, revision
> 1.2,
> 29 Jan 2003.
> <h2>
> Basic operation of Lucene</h2>
> In Lucene, a document consists of fields, and a field consists of
> terms.
> To execute queries Lucene analyses and indexes documents.
> <h3>
> Indexing and querying</h3>
> During indexing, an analyser software component extracts terms from
> each
> document.The analyser normally splits the documents into fields and
> words
> and removes stop words.For each field these words are then stored in
> a
> Lucene index as terms.
> <p>Once the index is ready it can be used for querying. A query
> consists
> of terms and phrases and results in a list of scored documents.
> <p>During query term preprocessing the query terms are normally
> analysed
> by the same analyzer that was used to build the index. At this point
> weights
> for terms and phrases are established.
> <p>During query search the indexes and the weights are used to assign
> a
> score to each document.
> <h3>
> Term based scoring</h3>
> The scoring part of Lucene is called the Similarity interface,
> because
> it determines how similar a document is to a query. It is also
> referred
> to as the 'custom scoring API'.
> <p>Lucene has a default implementation of this Similarity interface.
> By
> using another implementation the scoring can be changed. Such a
> change
> requires a straightforward program adaptation. This document
> indicates
> which Java methods of the Similarity interface can be changed. Each
> of
> these java methods represents a part of the scoring mechanism.
> <p>For scoring the following aspects of the Lucene query language are
> important:
> fields, terms, phrases, and query weights.
> <p>The default scoring method is based on an well established scoring
> mechanism
> for simple terms. It uses the logarithmic inverse document frequency
> for
> the term.
> <p>(Check: give a reference?).
> <p>Other query elements are truncated and imprecise terms and
> phrases.
> Their scoring is done by working back to this well established term
> scoring
> mechanism.
> <h3>
> Scoring of truncation and imprecision</h3>
> The Lucene query language allows truncation and imprecise queries (~
> operator
> for terms and phrases), their influence on scoring is currently not
> completely
> known to the author.
> <h2>
> &nbsp;Field weighing during indexing</h2>
> For each document, field length and name dependencies must be set
> during
> document indexing time, The field default weight is the inverse
> square
> root of the number of terms it contains for the document.
> <p>This field weighting can be used to give the fields of some
> documents
> an advantage over other documents, eg. as a result of citation
> analysis.
> <p>Field weighting can also be used to provide a minimum length for
> short
> fields (eg. titles) in order to prevent these from scoring high only
> because
> of their short length.
> <p>Another use is to provide a higher weight for fields with a priori
> known
> higher relevance.
> <p>Since field weights for queries can be also adapted in queries,
> field
> weighting during indexing is most useful to distinguish documents
> from
> each other.
> <p>Field weights in the index have about 1 decimal digit (3 bits)
> precision,
> they are stored as a single byte for each field of each
> <br>document.
> <p>Changes in the document field weights require that all documents
> are
> reindexed.
> <p>Java method:&nbsp; lengthNorm(fieldName, numberOfTerms)
> <h2>
> &nbsp;Query term preprocessing</h2>
> Initial query processing retrieves the document frequencies of the
> terms
> in the query. This is combined with the query weights to form the
> term
> and phrase weights to be used during query search.
> <h3>
> Query term weight and query phrase weight</h3>
> Since this is part of the query language there is no corresponding
> java
> method in the Lucene Similarity API.
> <p>The term or phrase weight given the query (default 1) is the first
> term
> weighing factor. It can be used ao. to compensate for unwanted
> effects
> of&nbsp; other term weighting as described below.
> <p>The Lucene query language allows to require the presence of a
> query
> term or phrase in a document field. Higher term or phrase frequencies
> within
> scored documents can eg. be obtained by using a higher query weight.
> <h3>
> Inverse document frequency of a term</h3>
> Another weight for a term within a query can depend on its document
> frequency
> (the number of documents in which the term occurs) and the total
> number
> of documents taking part in the search. The default for this weight
> is
> (1 + log(numDocs/(docFreq + 1))), ie. a
> <br>term score is lower when more documents contain the term.
> <p>Check: document frequencies of truncated query terms and imprecise
> query
> terms.
> <p>Java method: idf(docFreq, numDocs), idf stands for 'inverse
> document
> frequency'.
> <h3>
> Inverse document frequency of a phrase</h3>
> 
> <p><br>For a phrase the document frequency is not available before
> the
> query is evaluated. By default, the inverse document frequencies of
> the
> individual terms in a phrase are summed to provide a phrase weighing
> factor.
> <p>Java method: idf(terms, searcher)
> <br>The searcher here is the java object that executes the query.
> <h3>
> Query norm</h3>
> To make scores from different queries comparable, a query norm
> function
> is used, which is provided with the sum of the squared weights of all
> the
> query terms. This function does not affect the ranking order for a
> single
> query.
> <br>By default this function is the inverse of the square root.
> <p>Java method: queryNorm( sumOfQuaredWeights).
> <br>&nbsp;
> <h2>
> &nbsp;Query search</h2>
> For each document that satisfies the query, the search extracts the
> following
> information from the indexes. Here 'field' is used in the
> <br>sense of a field of the document being searched.
> <ol>
> <li>
> - field weight of the field in which a query term or phrase
> occurs,</li>
> 
> <li>
> - query term frequency within the field,</li>
> 
> <li>
> - query phrase frequency within the field,</li>
> 
> <li>
> - edit distance for imprecise query terms and query phrases within
> the
> field,</li>
> 
> <li>
> - the number of different query terms within the document.</li>
> </ol>
> 
> <h3>
> Term or phrase frequency in a document</h3>
> The frequency of a term or phrase within a document field is
> available
> for scoring. By default, the square root of this frequency is used.
> <p>Java method: tf( frequency)
> <h3>
> Imprecise occurrences</h3>
> For imprecise phrase matches,&nbsp; the 'edit distance' to a phrase
> is
> also available for scoring. The edit distance is a measure of how
> imprecise
> the match is.
> <p>It is used to compute the contribution of the match to the total
> frequency
> of the phrase in the document field. By default this 'sloppy
> <br>frequency contribution' is 1/(distance + 1).
> <p>The precise meaning of 'edit distance' needs further
> investigation.
> <p>Check:
> <br>For phrases the edit distance is computed using term proximity
> information
> from the index.
> <br>For terms the edit distance is the minimum number of single
> character
> edits (modify, insert, delete) between the query term and the
> occurring
> term.
> <p>Java method: sloppyFreq( distance)
> <h3>
> Query document overlap</h3>
> The number of different query terms that a document contains (ie. the
> overlap)
> and the number of terms in the query are used for another factor
> indicating
> how well the document matches the query as a whole. This allows to
> take
> into account the number of different non required terms occuring in
> the
> document.
> <p>Check: As truncated query terms are equivalent to the OR of all
> matching
> terms in the index, truncation can result in a large maxOverlap.
> <p>By default this factor is (overlap / nrQueryTerms). (The API
> documentation
> uses maxOverlap for nrQueryTerms, to be investigated).
> <p>Java method: coord(overlap, maxOverlap)
> <h2>
> Scoring formulas</h2>
> The following formulas determine how the document score for a query
> is
> computed.
> <br>&nbsp;
> <h3>
> Query preprocessing</h3>
> &nbsp;
> <table BORDER COLS=3 WIDTH="100%" >
> <tr>
> <td>numDocs</td>
> 
> <td>&nbsp;</td>
> 
> <td>The number of documents in the database, from the index</td>
> </tr>
> 
> <tr>
> <td>docFreq</td>
> 
> <td>&nbsp;</td>
> 
> <td>The number of documents in which a query term occurs, from the
> index</td>
> </tr>
> 
> <tr>
> <td>qtw</td>
> 
> <td>&nbsp;</td>
> 
> <td>The query weight of a term or phrase, from the query</td>
> </tr>
> 
> <tr>
> <td>tw</td>
> 
> <td>qtw * idf(docFreq, numDocs)</td>
> 
> <td>Weight of a term in the query</td>
> </tr>
> 
> <tr>
> <td>tw</td>
> 
> <td>qtw * idf(terms, searcher)</td>
> 
> <td>Weight of a phrase in the query</td>
> </tr>
> 
> <tr>
> <td>qn</td>
> 
> <td>queryNorm( SUM(tw * tw))</td>
> 
> <td>The query norm, summing over all terms and phrases in the
> query</td>
> </tr>
> </table>
> &nbsp;
> <br>&nbsp;
> <h3>
> Query search</h3>
> During search the actual occurences of terms and phrases in the
> document
> are taken into account. Here 'field' is used in the sense of a field
> of
> the document being scored. Occurrence is used in the sense of
> occurence
> in a field.
> <br>&nbsp;
> <br>&nbsp;
> <table BORDER COLS=3 WIDTH="100%" >
> <tr>
> <td>freq</td>
> 
> <td>&nbsp;</td>
> 
> <td>
> <br>&nbsp;Number of times a term occurs in a field, from the
> index.</td>
> </tr>
> 
> <tr>
> <td>distance</td>
> 
> <td>&nbsp;</td>
> 
> <td>&nbsp;See 'Imprecise occurrences'.</td>
> </tr>
> 
> <tr>
> <td>
> <br>overlap</td>
> 
> <td>&nbsp;</td>
> 
> <td>&nbsp;See 'Query document overlap'.</td>
> </tr>
> 
> <tr>
> <td>maxOverlap</td>
> 
> <td>&nbsp;</td>
> 
> <td>&nbsp;See 'Query document overlap'.</td>
> </tr>
> 
> <tr>
> <td>fw</td>
> 
> <td>lengthNorm(fieldName, numberOfTerms)</td>
> 
> <td>Field weight of the field of an occurrence</td>
> </tr>
> 
> <tr>
> <td>tfs&nbsp;</td>
> 
> <td>tw * fw * tf( freq)</td>
> 
> <td>&nbsp;Score of a term in a field</td>
> </tr>
> 
> <tr>
> <td>ssf</td>
> 
> <td>SUM(sloppyFreq( distance))</td>
> 
> <td>'Frequency' of imprecise occurrences</td>
> </tr>
> 
> <tr>
> <td>tfs</td>
> 
> <td>tw * fw * tf(ssf))</td>
> 
> <td>&nbsp;Score of imprecise occurrences</td>
> </tr>
> 
> <tr>
> <td>tds</td>
> 
> <td>SUM(tfs)</td>
> 
> <td>Total score of occurrences</td>
> </tr>
> 
> <tr>
> <td>crd</td>
> 
> <td>coord(overlap, maxOverlap)</td>
> 
> <td>Query document overlap</td>
> </tr>
> 
> <tr>
> <td>docscore</td>
> 
> <td>qn * tds * crd&nbsp;</td>
> 
> <td>Document score for the query</td>
> </tr>
> </table>
> &nbsp;
> </body>
> </html>
> 
> 
> -- 


__________________________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message