lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gsing...@apache.org
Subject svn commit: r454767 - in /lucene/java/trunk: CHANGES.txt docs/scoring.html src/java/org/apache/lucene/search/Similarity.java xdocs/scoring.xml
Date Tue, 10 Oct 2006 14:57:26 GMT
Author: gsingers
Date: Tue Oct 10 07:57:25 2006
New Revision: 454767

URL: http://svn.apache.org/viewvc?view=rev&rev=454767
Log:
Updated scoring information to only have one copy of the Scoring Formula.  Implemented Doron
Cohen's new scoring formula description in the javadoc.

Modified:
    lucene/java/trunk/CHANGES.txt
    lucene/java/trunk/docs/scoring.html
    lucene/java/trunk/src/java/org/apache/lucene/search/Similarity.java
    lucene/java/trunk/xdocs/scoring.xml

Modified: lucene/java/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?view=diff&rev=454767&r1=454766&r2=454767
==============================================================================
--- lucene/java/trunk/CHANGES.txt (original)
+++ lucene/java/trunk/CHANGES.txt Tue Oct 10 07:57:25 2006
@@ -76,9 +76,9 @@
     SingleInstanceLockFactory (ie, in memory locking) locking with an
     FSDirectory.  Note that now you must call setDisableLocks before
     the instantiation a FSDirectory if you wish to disable locking
-    for that Directory. 
+    for that Directory.
     (Michael McCandless, Jeff Patterson via Yonik Seeley)
- 
+
 Bug fixes
 
  1. Fixed the web application demo (built with "ant war-demo") which
@@ -127,7 +127,7 @@
     has no value.
     (Oliver Hutchison via Chris Hostetter)
 
-    
+
 Optimizations
 
   1. LUCENE-586: TermDocs.skipTo() is now more efficient for multi-segment
@@ -164,7 +164,8 @@
 
   1. Added style sheet to xdocs named lucene.css and included in the Anakia VSL descriptor.
 (Grant Ingersoll)
 
-  2. Added scoring.xml document into xdocs.(Grant Ingersoll and Steve Rowe.  Updates from:
Michael McCandless)
+  2. Added scoring.xml document into xdocs.  Updated Similarity.java scoring formula.(Grant
Ingersoll and Steve Rowe.  Updates from: Michael McCandless, Doron Cohen, Chris Hostetter,
Doug Cutting).  Issue 664.
+
 
 Release 2.0.0 2006-05-26
 

Modified: lucene/java/trunk/docs/scoring.html
URL: http://svn.apache.org/viewvc/lucene/java/trunk/docs/scoring.html?view=diff&rev=454767&r1=454766&r2=454767
==============================================================================
--- lucene/java/trunk/docs/scoring.html (original)
+++ lucene/java/trunk/docs/scoring.html Tue Oct 10 07:57:25 2006
@@ -192,84 +192,73 @@
                                                     <table border="0" cellspacing="0"
cellpadding="2" width="100%">
       <tr><td bgcolor="#828DA6">
         <font color="#ffffff" face="arial,helvetica,sanserif">
+          <a name="Score Boosting"><strong>Score Boosting</strong></a>
+        </font>
+      </td></tr>
+      <tr><td>
+        <blockquote>
+                                    <p>Lucene allows influencing search results by
"boosting" in more than one level:
+                  <ul>
+                    <li><b>Document level boosting</b>
+                    - while indexing - by calling
+                    <a href="api/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
+                    before a document is added to the index.
+                    </li>
+                    <li><b>Document's Field level boosting</b>
+                    - while indexing - by calling
+                    <a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
+                    before adding a field to the document (and before adding the document
to the index).
+                    </li>
+                    <li><b>Query level boosting</b>
+                     - during search, by setting a boost on a query clause, calling
+                     <a href="api/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
+                    </li>
+                  </ul>
+                </p>
+                                                <p>Indexing time boosts are preprocessed
for storage efficiency and written to
+                  the directory (when writing the document) in a single byte (!) as follows:
+                  For each field of a document, all boosts of that field
+                  (i.e. all boosts under the same field name in that doc) are multiplied.
+                  The result is multiplied by the boost of the document,
+                  and also multiplied by a "field length norm" value
+                  that represents the length of that field in that doc
+                  (so shorter fields are automatically boosted up).
+                  The result is decoded as a single byte
+                  (with some precision loss of course) and stored in the directory.
+                  The similarity object in effect at indexing computes the length-norm of
the field.
+                </p>
+                                                <p>This composition of 1-byte representation
of norms
+                (that is, indexing time multiplication of field boosts &amp; doc boost
&amp; field-length-norm)
+                is nicely described in
+                <a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
+                </p>
+                                                <p>Encoding and decoding of the resulted
float norm in a single byte are done by the
+                static methods of the class Similarity:
+                <a href="api/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a>
and
+                <a href="api/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
+                Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
+                e.g. decode(encode(0.89)) = 0.75.
+                At scoring (search) time, this norm is brought into the score of document
+                as <b>indexBoost</b>, as shown by the formula in
+                <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>.
+                </p>
+                            </blockquote>
+      </td></tr>
+      <tr><td><br/></td></tr>
+    </table>
+                                                    <table border="0" cellspacing="0"
cellpadding="2" width="100%">
+      <tr><td bgcolor="#828DA6">
+        <font color="#ffffff" face="arial,helvetica,sanserif">
           <a name="Understanding the Scoring Formula"><strong>Understanding the
Scoring Formula</strong></a>
         </font>
       </td></tr>
       <tr><td>
         <blockquote>
                                     <p>
-                    Lucene's scoring formula computes the score of one document <i>d</i>
for a given query <i>q</i> across each
-                    term <i>t</i> that occurs in q.  The score attempts to measure
relevance, so the higher the score, the more
-		    relevant document <i>d</i> is to the query <i>q</i>.  This
is taken from
-		    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>:
-
-                    <div class="formula">
-                        <!-- Anyone know how to specify sigma in Anakia?  It always seems
to strip out my numeric character references-->
-                        score(q,d) =
-			<span class="big" id="summation">
-                            sum </span><span class="summation-range">t in q</span><span>(
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
-                        (t in d) *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term,
org.apache.lucene.search.Searcher)">idf</A>
-                        (t)^2 *
-                        <A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
-                        getBoost
-                        </A>
-                        (t in q) *
-                        getBoost
-                        (t.field in d) *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,
int)">
-                            lengthNorm
-                        </A>
-                        (t.field in d) )</span> <span> *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#coord(int,
int)">
-                            coord
-                        </A>
-                        (q,d) *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
-                            queryNorm
-                        </A>(sumOfSquaredWeights)</span>
-                    </div>
-                </p>
-                                                <p>
-                    where
-                    <!-- Anyone know how to specify sigma in Anakia?  It always seems
to strip out my numeric character references-->
-                    <div id="#sumOfSquares">
-                        sumOfSquaredWeights =
-                        <span class="big">sum</span><span class="summation-range">t
in q</span><span>(
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term,
org.apache.lucene.search.Searcher)">
-                            idf
-                        </A>
-                        (t) *
-                        <A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
-                            getBoost
-                        </A>
-                        (t in q) )^2</span>
-                    </div>
-                </p>
-                                                <p>
-		This scoring formula is mostly implemented in the
-                    <a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a>
class, where it makes calls to the
-                    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>
class to retrieve values for the following.  Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
implementation:
-                    <ol>
-
-                        <li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t
in d)</A> - Term Frequency - The number of times the term <i>t</i> appears
in the current document <i>d</i> being scored.  Documents that have more occurrences
of a given term receive a higher score.</li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term,
org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One
divided by the number of documents in which the term <i>t</i> appears.  This means
rarer terms give higher contribution to the total score.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t
in q)</A> - The boost, specified in the query by the user, that should be applied to
this term.  A boost over 1.0 will increase the importance of this term; a boost under 1.0
will decrease its importance.  A boost of 1.0 (the default boost) has no effect.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,
int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing
lengths in the fields that are being searched.  Typically longer fields return a smaller value.
 This means matches against shorter fields receive a higher score than matches against longer
fields.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int,
int)">coord(q, d)</A> - Score factor based on how many terms the specified document
has in common with the query.  Typically, a document that contains more of the query's terms
will receive a higher score than another document with fewer query terms.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A>
- Factor used to make scores between queries comparable
-                            <span class="highlight-for-editing">GSI: might be interesting
to have a note on why this formula was chosen.  I have always understood (but not 100% sure)
-                                that it is not a good idea to compare scores across queries
or indexes, so any use of normalization may lead to false assumptions.  However, I also seem
-                            to remember some research on using sum of squares as being somewhat
suitable for score comparison.  Anyone have any thoughts here?</span></p></li>
-                    </ol>
-                    Note, the above definitions are summaries of the javadocs which can be
accessed by clicking the links in the formula and are merely provided
-                    for context and are not authoratitive.
+                This scoring formula is described in the
+                    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>
class.  Please take the time to study this formula, as it contains much of the information
about how the
+                    basics of Lucene scoring work, especially the
+                    <a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a>.
                 </p>
                             </blockquote>
       </td></tr>
@@ -295,7 +284,7 @@
                     These implementations can be combined in a wide variety of ways to provide
complex querying
                     capabilities along with
                     information about where matches took place in the document collection.
The <a href="#Query Classes">Query</a>
-                    section below 
+                    section below
                     highlights some of the more important Query classes.  For information
on the other ones, see the
                     <a href="api/org/apache/lucene/search/package-summary.html">package
summary</a>.  For details on implementing
                     your own Query class, see <a href="#Changing your Scoring -- Expert
Level">Changing your Scoring --

Modified: lucene/java/trunk/src/java/org/apache/lucene/search/Similarity.java
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/Similarity.java?view=diff&rev=454767&r1=454766&r2=454767
==============================================================================
--- lucene/java/trunk/src/java/org/apache/lucene/search/Similarity.java (original)
+++ lucene/java/trunk/src/java/org/apache/lucene/search/Similarity.java Tue Oct 10 07:57:25
2006
@@ -16,67 +16,271 @@
  * limitations under the License.
  */
 
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.IndexWriter;
-import org.apache.lucene.index.Term;
-import org.apache.lucene.util.SmallFloat;
-
 import java.io.IOException;
 import java.io.Serializable;
 import java.util.Collection;
 import java.util.Iterator;
 
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.util.SmallFloat;
+
 /** Expert: Scoring API.
  * <p>Subclasses implement search scoring.
  *
- * <p>The score of query <code>q</code> for document <code>d</code>
is defined
- * in terms of these methods as follows:
+ * <p>The score of query <code>q</code> for document <code>d</code>
correlates to the
+ * cosine-distance or dot-product between document and query vectors in a
+ * <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">
+ * Vector Space Model (VSM) of Information Retrieval</a>.
+ * A document whose vector is closer to the query vector in that model is scored higher.
  *
- * <table cellpadding="0" cellspacing="0" border="0">
+ * The score is computed as follows:
+ *
+ * <P>
+ * <table cellpadding="1" cellspacing="0" border="1" align="center">
+ * <tr><td>
+ * <table cellpadding="1" cellspacing="0" border="0" align="center">
  *  <tr>
- *    <td valign="middle" align="right" rowspan="2">score(q,d) =<br></td>
- *    <td valign="middle" align="center">
- *    <big><big><big><big><big>&Sigma;</big></big></big></big></big></td>
- *    <td valign="middle"><small>
- *    ( {@link #tf(int) tf}(t in d) *
- *    {@link #idf(Term,Searcher) idf}(t)^2 *
- *    {@link Query#getBoost getBoost}(t in q) *
- *    {@link org.apache.lucene.document.Field#getBoost getBoost}(t.field in d) *
- *    {@link #lengthNorm(String,int) lengthNorm}(t.field in d) )
- *    </small></td>
- *    <td valign="middle" rowspan="2">&nbsp;*
- *    {@link #coord(int,int) coord}(q,d) *
- *    {@link #queryNorm(float) queryNorm}(sumOfSqaredWeights)
+ *    <td valign="middle" align="right" rowspan="1">
+ *      score(q,d) &nbsp; = &nbsp;
+ *      <A HREF="#formula_coord">coord(q,d)</A> &nbsp;&middot;&nbsp;
+ *      <A HREF="#formula_queryNorm">queryNorm(q)</A> &nbsp;&middot;&nbsp;
  *    </td>
- *  </tr>
- *  <tr>
- *   <td valign="top" align="right">
- *    <small>t in q</small>
+ *    <td valign="bottom" align="center" rowspan="1">
+ *      <big><big><big>&sum;</big></big></big>
  *    </td>
- *  </tr>
- * </table>
- * 
- * <p> where
- * 
- * <table cellpadding="0" cellspacing="0" border="0">
- *  <tr>
- *    <td valign="middle" align="right" rowspan="2">sumOfSqaredWeights =<br></td>
- *    <td valign="middle" align="center">
- *    <big><big><big><big><big>&Sigma;</big></big></big></big></big></td>
- *    <td valign="middle"><small>
- *    ( {@link #idf(Term,Searcher) idf}(t) *
- *    {@link Query#getBoost getBoost}(t in q) )^2
- *    </small></td>
- *  </tr>
- *  <tr>
- *   <td valign="top" align="right">
- *    <small>t in q</small>
+ *    <td valign="middle" align="right" rowspan="1">
+ *      <big><big>(</big></big>
+ *      <A HREF="#formula_tf">tf(t in d)</A> &nbsp;&middot;&nbsp;
+ *      <A HREF="#formula_idf">idf(t)</A><sup>2</sup> &nbsp;&middot;&nbsp;
+ *      <A HREF="#formula_termBoost">t.getBoost()</A>&nbsp;&middot;&nbsp;
+ *      <A HREF="#formula_norm">norm(t,d)</A>
+ *      <big><big>)</big></big>
  *    </td>
  *  </tr>
+ *  <tr valigh="top">
+ *   <td></td>
+ *   <td align="center"><small>t in q</small></td>
+ *   <td></td>
+ *  </tr>
+ * </table>
+ * </td></tr>
  * </table>
- * 
- * <p> Note that the above formula is motivated by the cosine-distance or dot-product
- * between document and query vector, which is implemented by {@link DefaultSimilarity}.
+ *
+ * <p> where
+ * <ol>
+ *    <li>
+ *      <A NAME="formula_tf"></A>
+ *      <b>tf(t in d)</b>
+ *      correlates to the term's <i>frequency</i>,
+ *      defined as the number of times term <i>t</i> appears in the currently
scored document <i>d</i>.
+ *      Documents that have more occurrences of a given term receive a higher score.
+ *      The default computation for <i>tf(t in d)</i> in
+ *      {@link org.apache.lucene.search.DefaultSimilarity#tf(float) DefaultSimilarity} is:
+ *
+ *      <br>&nbsp;<br>
+ *      <table cellpadding="2" cellspacing="2" border="0" align="center">
+ *        <tr>
+ *          <td valign="middle" align="right" rowspan="1">
+ *            {@link org.apache.lucene.search.DefaultSimilarity#tf(float) tf(t in d)} &nbsp;
= &nbsp;
+ *          </td>
+ *          <td valign="top" align="center" rowspan="1">
+ *               frequency<sup><big>&frac12;</big></sup>
+ *          </td>
+ *        </tr>
+ *      </table>
+ *      <br>&nbsp;<br>
+ *    </li>
+ *
+ *    <li>
+ *      <A NAME="formula_idf"></A>
+ *      <b>idf(t)</b> stands for Inverse Document Frequency. This value
+ *      correlates to the inverse of <i>docFreq</i>
+ *      (the number of documents in which the term <i>t</i> appears).
+ *      This means rarer terms give higher contribution to the total score.
+ *      The default computation for <i>idf(t)</i> in
+ *      {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) DefaultSimilarity}
is:
+ *
+ *      <br>&nbsp;<br>
+ *      <table cellpadding="2" cellspacing="2" border="0" align="center">
+ *        <tr>
+ *          <td valign="middle" align="right">
+ *            {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) idf(t)}&nbsp;
= &nbsp;
+ *          </td>
+ *          <td valign="middle" align="center">
+ *            1 + log <big>(</big>
+ *          </td>
+ *          <td valign="middle" align="center">
+ *            <table>
+ *               <tr><td align="center"><small>numDocs</small></td></tr>
+ *               <tr><td align="center">&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;</td></tr>
+ *               <tr><td align="center"><small>docFreq+1</small></td></tr>
+ *            </table>
+ *          </td>
+ *          <td valign="middle" align="center">
+ *            <big>)</big>
+ *          </td>
+ *        </tr>
+ *      </table>
+ *      <br>&nbsp;<br>
+ *    </li>
+ *
+ *    <li>
+ *      <A NAME="formula_coord"></A>
+ *      <b>coord(q,d)</b>
+ *      is a score factor based on how many of the query terms are found in the specified
document.
+ *      Typically, a document that contains more of the query's terms will receive a higher
score
+ *      than another document with fewer query terms.
+ *      This is a search time factor computed in
+ *      {@link #coord(int, int) coord(q,d)}
+ *      by the Similarity in effect at search time.
+ *      <br>&nbsp;<br>
+ *    </li>
+ *
+ *    <li><b>
+ *      <A NAME="formula_queryNorm"></A>
+ *      queryNorm(q)
+ *      </b>
+ *      is a normalizing factor used to make scores between queries comparable.
+ *      This factor does not affect document ranking (since all ranked documents are multiplied
by the same factor),
+ *      but rather just attempts to make scores from different queries (or even different
indexes) comparable.
+ *      This is a search time factor computed by the Similarity in effect at search time.
+ *
+ *      The default computation in
+ *      {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) DefaultSimilarity}
+ *      is:
+ *      <br>&nbsp;<br>
+ *      <table cellpadding="1" cellspacing="0" border="0" align="center">
+ *        <tr>
+ *          <td valign="middle" align="right" rowspan="1">
+ *            queryNorm(q)  &nbsp; = &nbsp;
+ *            {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) queryNorm(sumOfSquaredWeights)}
+ *            &nbsp; = &nbsp;
+ *          </td>
+ *          <td valign="middle" align="center" rowspan="1">
+ *            <table>
+ *               <tr><td align="center"><big>1</big></td></tr>
+ *               <tr><td align="center"><big>
+ *                  &ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;
+ *               </big></td></tr>
+ *               <tr><td align="center">sumOfSquaredWeights<sup><big>&frac12;</big></sup></td></tr>
+ *            </table>
+ *          </td>
+ *        </tr>
+ *      </table>
+ *      <br>&nbsp;<br>
+ *
+ *      The sum of squared weights (of the query terms) is
+ *      computed by the query {@link org.apache.lucene.search.Weight} object.
+ *      For example, a {@link org.apache.lucene.search.BooleanQuery boolean query}
+ *      computes this value as:
+ *
+ *      <br>&nbsp;<br>
+ *      <table cellpadding="1" cellspacing="0" border="0"n align="center">
+ *        <tr>
+ *          <td valign="middle" align="right" rowspan="1">
+ *            {@link org.apache.lucene.search.Weight#sumOfSquaredWeights() sumOfSquaredWeights}
&nbsp; = &nbsp;
+ *            {@link org.apache.lucene.search.Query#getBoost() q.getBoost()} <sup><big>2</big></sup>
+ *            &nbsp;&middot;&nbsp;
+ *          </td>
+ *          <td valign="bottom" align="center" rowspan="1">
+ *            <big><big><big>&sum;</big></big></big>
+ *          </td>
+ *          <td valign="middle" align="right" rowspan="1">
+ *            <big><big>(</big></big>
+ *            <A HREF="#formula_idf">idf(t)</A> &nbsp;&middot;&nbsp;
+ *            <A HREF="#formula_termBoost">t.getBoost()</A>
+ *            <big><big>) <sup>2</sup> </big></big>
+ *          </td>
+ *        </tr>
+ *        <tr valigh="top">
+ *          <td></td>
+ *          <td align="center"><small>t in q</small></td>
+ *          <td></td>
+ *        </tr>
+ *      </table>
+ *      <br>&nbsp;<br>
+ *
+ *    </li>
+ *
+ *    <li>
+ *      <A NAME="formula_termBoost"></A>
+ *      <b>t.getBoost()</b>
+ *      is a search time boost of term <i>t</i> in the query <i>q</i>
as
+ *      specified in the query text
+ *      (see <A HREF="../../../../../queryparsersyntax.html#Boosting a Term">query
syntax</A>),
+ *      or as set by application calls to
+ *      {@link org.apache.lucene.search.Query#setBoost(float) setBoost()}.
+ *      Notice that there is really no direct API for accessing a boost of one term in a
multi term query,
+ *      but rather multi terms are represented in a query as multi
+ *      {@link org.apache.lucene.search.TermQuery TermQuery} objects,
+ *      and so the boost of a term in the query is accessible by calling the sub-query
+ *      {@link org.apache.lucene.search.Query#getBoost() getBoost()}.
+ *      <br>&nbsp;<br>
+ *    </li>
+ *
+ *    <li>
+ *      <A NAME="formula_norm"></A>
+ *      <b>norm(t,d)</b> encapsulates a few (indexing time) boost and length
factors:
+ *
+ *      <ul>
+ *        <li><b>Document boost</b> - set by calling
+ *        {@link org.apache.lucene.document.Document#setBoost(float) doc.setBoost()}
+ *        before adding the document to the index.
+ *        </li>
+ *        <li><b>Field boost</b> - set by calling
+ *        {@link org.apache.lucene.document.Fieldable#setBoost(float) field.setBoost()}
+ *        before adding the field to a document.
+ *        </li>
+ *        <li>{@link #lengthNorm(String, int) <b>lengthNorm</b>(field)}
- computed
+ *        when the document is added to the index in accordance with the number of tokens
+ *        of this field in the document, so that shorter fields contribute more to the score.
+ *        LengthNorm is computed by the Similarity class in effect at indexing.
+ *        </li>
+ *      </ul>
+ *
+ *      <p>
+ *      When a document is added to the index, all the above factors are multiplied.
+ *      If the document has multiple fields with the same name, all their boosts are multiplied
together:
+ *
+ *      <br>&nbsp;<br>
+ *      <table cellpadding="1" cellspacing="0" border="0"n align="center">
+ *        <tr>
+ *          <td valign="middle" align="right" rowspan="1">
+ *            norm(t,d) &nbsp; = &nbsp;
+ *            {@link org.apache.lucene.document.Document#getBoost() doc.getBoost()}
+ *            &nbsp;&middot;&nbsp;
+ *            {@link #lengthNorm(String, int) lengthNorm(field)}
+ *            &nbsp;&middot;&nbsp;
+ *          </td>
+ *          <td valign="bottom" align="center" rowspan="1">
+ *            <big><big><big>&prod;</big></big></big>
+ *          </td>
+ *          <td valign="middle" align="right" rowspan="1">
+ *            {@link org.apache.lucene.document.Fieldable#getBoost() f.getBoost}()
+ *          </td>
+ *        </tr>
+ *        <tr valigh="top">
+ *          <td></td>
+ *          <td align="center"><small>field <i><b>f</b></i>
in <i>d</i> named as <i><b>t</b></i></small></td>
+ *          <td></td>
+ *        </tr>
+ *      </table>
+ *      <br>&nbsp;<br>
+ *      However the resulted <i>norm</i> value is {@link #encodeNorm(float) encoded}
as a single byte
+ *      before being stored.
+ *      At search time, the norm byte value is read from the index
+ *      {@link org.apache.lucene.store.Directory directory} and
+ *      {@link #decodeNorm(byte) decoded} back to a float <i>norm</i> value.
+ *      This encoding/decoding, while reducing index size, comes with the price of
+ *      precision loss - it is not guaranteed that decode(encode(x)) = x.
+ *      For instance, decode(encode(0.89)) = 0.75.
+ *      Also notice that search time is too late to modify this <i>norm</i> part
of scoring, e.g. by
+ *      using a different {@link Similarity} for search.
+ *      <br>&nbsp;<br>
+ *    </li>
+ * </ol>
  *
  * @see #setDefault(Similarity)
  * @see IndexWriter#setSimilarity(Similarity)

Modified: lucene/java/trunk/xdocs/scoring.xml
URL: http://svn.apache.org/viewvc/lucene/java/trunk/xdocs/scoring.xml?view=diff&rev=454767&r1=454766&r2=454767
==============================================================================
--- lucene/java/trunk/xdocs/scoring.xml (original)
+++ lucene/java/trunk/xdocs/scoring.xml Tue Oct 10 07:57:25 2006
@@ -61,80 +61,60 @@
                     on the Fields).
                 </p>
             </subsection>
-            <subsection name="Understanding the Scoring Formula">
-                <p>
-                    Lucene's scoring formula computes the score of one document <i>d</i>
for a given query <i>q</i> across each
-                    term <i>t</i> that occurs in q.  The score attempts to measure
relevance, so the higher the score, the more
-		    relevant document <i>d</i> is to the query <i>q</i>.  This
is taken from
-		    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>:
-
-                    <div class="formula">
-                        <!-- Anyone know how to specify sigma in Anakia?  It always seems
to strip out my numeric character references-->
-                        score(q,d) =
-			<span class="big" id="summation">
-                            sum </span><span class="summation-range">t in q</span><span>(
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
-                        (t in d) *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term,
org.apache.lucene.search.Searcher)">idf</A>
-                        (t)^2 *
-                        <A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
-                        getBoost
-                        </A>
-                        (t in q) *
-                        getBoost
-                        (t.field in d) *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,
int)">
-                            lengthNorm
-                        </A>
-                        (t.field in d) )</span> <span> *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#coord(int,
int)">
-                            coord
-                        </A>
-                        (q,d) *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
-                            queryNorm
-                        </A>(sumOfSquaredWeights)</span>
-                    </div>
+            <subsection name="Score Boosting">
+                <p>Lucene allows influencing search results by "boosting" in more than
one level:
+                  <ul>
+                    <li><b>Document level boosting</b>
+                    - while indexing - by calling
+                    <a href="api/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
+                    before a document is added to the index.
+                    </li>
+                    <li><b>Document's Field level boosting</b>
+                    - while indexing - by calling
+                    <a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
+                    before adding a field to the document (and before adding the document
to the index).
+                    </li>
+                    <li><b>Query level boosting</b>
+                     - during search, by setting a boost on a query clause, calling
+                     <a href="api/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
+                    </li>
+                  </ul>
                 </p>
-                <p>
-                    where
-                    <!-- Anyone know how to specify sigma in Anakia?  It always seems
to strip out my numeric character references-->
-                    <div id="#sumOfSquares">
-                        sumOfSquaredWeights =
-                        <span class="big">sum</span><span class="summation-range">t
in q</span><span>(
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term,
org.apache.lucene.search.Searcher)">
-                            idf
-                        </A>
-                        (t) *
-                        <A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
-                            getBoost
-                        </A>
-                        (t in q) )^2</span>
-                    </div>
+                <p>Indexing time boosts are preprocessed for storage efficiency and
written to
+                  the directory (when writing the document) in a single byte (!) as follows:
+                  For each field of a document, all boosts of that field
+                  (i.e. all boosts under the same field name in that doc) are multiplied.
+                  The result is multiplied by the boost of the document,
+                  and also multiplied by a "field length norm" value
+                  that represents the length of that field in that doc
+                  (so shorter fields are automatically boosted up).
+                  The result is decoded as a single byte
+                  (with some precision loss of course) and stored in the directory.
+                  The similarity object in effect at indexing computes the length-norm of
the field.
                 </p>
-                <p>
-		This scoring formula is mostly implemented in the
-                    <a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a>
class, where it makes calls to the
-                    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>
class to retrieve values for the following.  Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
implementation:
-                    <ol>
-
-                        <li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t
in d)</A> - Term Frequency - The number of times the term <i>t</i> appears
in the current document <i>d</i> being scored.  Documents that have more occurrences
of a given term receive a higher score.</li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term,
org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One
divided by the number of documents in which the term <i>t</i> appears.  This means
rarer terms give higher contribution to the total score.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t
in q)</A> - The boost, specified in the query by the user, that should be applied to
this term.  A boost over 1.0 will increase the importance of this term; a boost under 1.0
will decrease its importance.  A boost of 1.0 (the default boost) has no effect.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,
int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing
lengths in the fields that are being searched.  Typically longer fields return a smaller value.
 This means matches against shorter fields receive a higher score than matches against longer
fields.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int,
int)">coord(q, d)</A> - Score factor based on how many terms the specified document
has in common with the query.  Typically, a document that contains more of the query's terms
will receive a higher score than another document with fewer query terms.</p></li>
+                <p>This composition of 1-byte representation of norms
+                (that is, indexing time multiplication of field boosts &amp; doc boost
&amp; field-length-norm)
+                is nicely described in
+                <a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
+                </p>
+                <p>Encoding and decoding of the resulted float norm in a single byte
are done by the
+                static methods of the class Similarity:
+                <a href="api/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a>
and
+                <a href="api/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
+                Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
+                e.g. decode(encode(0.89)) = 0.75.
+                At scoring (search) time, this norm is brought into the score of document
+                as <b>indexBoost</b>, as shown by the formula in
+                <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>.
+                </p>
+            </subsection>
+            <subsection name="Understanding the Scoring Formula">
 
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A>
- Factor used to make scores between queries comparable
-                            <span class="highlight-for-editing">GSI: might be interesting
to have a note on why this formula was chosen.  I have always understood (but not 100% sure)
-                                that it is not a good idea to compare scores across queries
or indexes, so any use of normalization may lead to false assumptions.  However, I also seem
-                            to remember some research on using sum of squares as being somewhat
suitable for score comparison.  Anyone have any thoughts here?</span></p></li>
-                    </ol>
-                    Note, the above definitions are summaries of the javadocs which can be
accessed by clicking the links in the formula and are merely provided
-                    for context and are not authoratitive.
+                <p>
+                This scoring formula is described in the
+                    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>
class.  Please take the time to study this formula, as it contains much of the information
about how the
+                    basics of Lucene scoring work, especially the
+                    <a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a>.
                 </p>
             </subsection>
             <subsection name="The Big Picture">
@@ -150,7 +130,7 @@
                     These implementations can be combined in a wide variety of ways to provide
complex querying
                     capabilities along with
                     information about where matches took place in the document collection.
The <a href="#Query Classes">Query</a>
-                    section below 
+                    section below
                     highlights some of the more important Query classes.  For information
on the other ones, see the
                     <a href="api/org/apache/lucene/search/package-summary.html">package
summary</a>.  For details on implementing
                     your own Query class, see <a href="#Changing your Scoring -- Expert
Level">Changing your Scoring --



Mime
View raw message