lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gsing...@apache.org
Subject svn commit: r438127 - /lucene/java/trunk/docs/scoring.html
Date Tue, 29 Aug 2006 17:43:19 GMT
Author: gsingers
Date: Tue Aug 29 10:43:18 2006
New Revision: 438127

URL: http://svn.apache.org/viewvc?rev=438127&view=rev
Log:
scoring documentation updates from Michael McCandless

Modified:
    lucene/java/trunk/docs/scoring.html

Modified: lucene/java/trunk/docs/scoring.html
URL: http://svn.apache.org/viewvc/lucene/java/trunk/docs/scoring.html?rev=438127&r1=438126&r2=438127&view=diff
==============================================================================
--- lucene/java/trunk/docs/scoring.html (original)
+++ lucene/java/trunk/docs/scoring.html Tue Aug 29 10:43:18 2006
@@ -181,7 +181,7 @@
                     and the other in one Field will return different scores for the same
query due to length normalization
                     (assumming the
                     <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
-                    on the Fields.)
+                    on the Fields).
                 </p>
                             </blockquote>
       </td></tr>
@@ -288,13 +288,13 @@
                     <a href="api/org/apache/lucene/search/Query.html">Query</a>
classes, as created by each application in
                     response to a user's information need.
                 </p>
-                                                <p>In this regard, Lucene offers a
wide variety of Query implementations, most of which are in the
-                    org.apache.lucene.search package.
+                                                <p>In this regard, Lucene offers a
wide variety of <a href="api/org/apache/lucene/search/Query.html">Query</a> implementations,
most of which are in the
+                    <a href="api/org/apache/lucene/search/package-summary.html">org.apache.lucene.search</a>
package.
                     These implementations can be combined in a wide variety of ways to provide
complex querying
                     capabilities along with
                     information about where matches took place in the document collection.
The <a href="#Query Classes">Query</a>
-                    section below will
-                    highlight some of the more important Query classes.  For information
on the other ones, see the
+                    section below 
+                    highlights some of the more important Query classes.  For information
on the other ones, see the
                     <a href="api/org/apache/lucene/search/package-summary.html">package
summary</a>.  For details on implementing
                     your own Query class, see <a href="#Changing your Scoring -- Expert
Level">Changing your Scoring --
                     Expert Level</a> below.
@@ -302,7 +302,7 @@
                                                 <p>Once a Query has been created and
submitted to the
                     <a href="api/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>,
the scoring process
                 begins.  (See the <a href="#Appendix">Appendix</a> Algorithm
section for more notes on the process.)  After some infrastructure setup,
-                control finally passes to the Weight implementation and it's
+                control finally passes to the <a href="api/org/apache/lucene/search/Weight.html">Weight</a>
implementation and its
                     <a href="api/org/apache/lucene/search/Scorer.html">Scorer</a>
instance.  In the case of any type of
                     <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>,
scoring is handled by the
                     <a href="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a>
(link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class),
@@ -340,68 +340,77 @@
                     <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a>
                 </h4>
                                                 <p>Of the various implementations of
-                    Query, the
+                    <a href="api/org/apache/lucene/search/Query.html">Query</a>,
the
                     <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a>
-                    is the easiest to understand and the most often used in most applications.
A TermQuery is a Query
-                    that matches all the documents that contain the specified
-                    <a href="api/org/apache/lucene/index/Term.html">Term</a>
-                    . A Term is a word that occurs in a specific
-                    <a href="api/org/apache/lucene/document/Field.html">Field</a>
-                    . Thus, a TermQuery identifies and scores all
-                    <a href="api/org/apache/lucene/document/Document.html">Document</a>
-                    s that have a Field with the specified string in it.
-                    Constructing a TermQuery is as simple as:
-                    <code>TermQuery tq = new TermQuery(new Term("fieldName", "term");</code>
-                    In this example, the Query would identify all Documents that have the
Field named "fieldName" that
-                    contain the word "term".
+                    is the easiest to understand and the most often used in applications.
A <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a> matches
all the documents that contain the specified
+                    <a href="api/org/apache/lucene/index/Term.html">Term</a>,
+                    which is a word that occurs in a certain
+                    <a href="api/org/apache/lucene/document/Field.html">Field</a>.
+                    Thus, a <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a>
identifies and scores all
+                    <a href="api/org/apache/lucene/document/Document.html">Document</a>s
that have a <a href="api/org/apache/lucene/document/Field.html">Field</a> with
the specified string in it.
+                    Constructing a <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a>
+                    is as simple as:
+		    <pre>
+      TermQuery tq = new TermQuery(new Term("fieldName", "term");
+		    </pre>In this example, the <a href="api/org/apache/lucene/search/Query.html">Query</a>
identifies all <a href="api/org/apache/lucene/document/Document.html">Document</a>s
that have the <a href="api/org/apache/lucene/document/Field.html">Field</a> named
<tt>"fieldName"</tt> and
+                    contain the word <tt>"term"</tt>.
                 </p>
                                                 <h4>
                     <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>
                 </h4>
-                                                <p>Things start to get interesting
when one starts to combine TermQuerys, which is handled by the
-                    <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>
-                    class. The BooleanQuery is a collection
-                    of other
-                    <a href="api/org/apache/lucene/search/Query.html">Query</a>
-                    classes along with semantics about how to combine the different subqueries.
-                    It currently supports three different operators for specifying the logic
of the query (see
-                    <a href="api/org/apache/lucene/search/BooleanClause.html">BooleanClause</a>
-                    )
+                                                <p>Things start to get interesting
when one combines multiple
+		   <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a> instances
into a <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>.
+                    A <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>
contains multiple
+		    <a href="api/org/apache/lucene/search/BooleanClause.html">BooleanClause</a>s,
+                    where each clause contains a sub-query (<a href="api/org/apache/lucene/search/Query.html">Query</a>
+		   instance) and an operator (from <a href="api/org/apache/lucene/search/BooleanClause.Occur.html">BooleanClause.Occur</a>)
+		   describing how that sub-query is combined with the other clauses:
                     <ol>
-                        <li>SHOULD -- Use this operator when a clause can occur in
the result set, but is not required.
-                            If a query is made up of all SHOULD clauses, then a non-empty
result
-                            set will have matched at least one of the clauses in the query.</li>
-                        <li>MUST -- Use this operator when a clause is required to
occur in the result set.</li>
-                        <li>MUST NOT -- Use this operator when a clause must not occur
in the result set.</li>
+
+                        <li><p>SHOULD -- Use this operator when a clause can
occur in the result set, but is not required.
+                            If a query is made up of all SHOULD clauses, then every document
in the result
+                            set matches at least one of these clauses.</p></li>
+
+                        <li><p>MUST -- Use this operator when a clause is required
to occur in the result set.  Every
+			       document in the result set will match
+			       all such clauses.</p></li>
+
+                        <li><p>MUST NOT -- Use this operator when a
+			       clause must not occur in the result set.  No
+		               document in the result set will match
+                               any such clauses.</p></li>
                     </ol>
                     Boolean queries are constructed by adding two or more
                     <a href="api/org/apache/lucene/search/BooleanClause.html">BooleanClause</a>
-                    instances to the BooleanQuery instance. In some cases,
-                    too many clauses may be added to the BooleanQuery, which will cause a
TooManyClauses exception to be
-                    thrown. This
-                    most often occurs when using a Query that is rewritten into many TermQuery
instances, such as the
-                    <a href="api/org/apache/lucene/search/WildCardQuery.html">WildCardQuery</a>
-                    . The default
-                    setting for too many clauses is currently set to 1024, but it can be
overridden via the
-                    <a href="api/org/apache/lucene/search/BooleanQuery.html#setMaxClauseCount(int)">BooleanQuery#setMaxClauseCount(int)</a>
static method on BooleanQuery.
+                    instances. If too many clauses are added, a <a href="api/org/apache/lucene/search/BooleanQuery.TooManyClauses.html">TooManyClauses</a>
+	           exception will be thrown during searching. This most often occurs 
+		   when a <a href="api/org/apache/lucene/search/Query.html">Query</a>
+		   is rewritten into a <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>
with many
+		   <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a> clauses,
+		   for example by <a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a>.
+                   The default setting for the maximum number
+		   of clauses 1024, but this can be changed via the
+		   static method <a href="api/org/apache/lucene/search/BooleanQuery.html#setMaxClauseCount(int)">setMaxClauseCount</a>
+		   in <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>.
                 </p>
                                                 <h4>Phrases</h4>
-                                                <p>Another common task in search is
to identify phrases, which can be handled in two different ways.
+                                                <p>Another common search is to find
documents containing certain phrases.  This
+                   is handled in two different ways.
                     <ol>
                         <li>
-                            <a href="api/org/apache/lucene/search/PhraseQuery.html">PhraseQuery</a>
+                            <p><a href="api/org/apache/lucene/search/PhraseQuery.html">PhraseQuery</a>
                             -- Matches a sequence of
-                            <a href="api/org/apache/lucene/index/Term.html">Terms</a>
-                            . The PhraseQuery can specify a slop factor which determines
-                            how many positions may occur between any two terms and still
be considered a match.
+                            <a href="api/org/apache/lucene/index/Term.html">Terms</a>.
+                            <a href="api/org/apache/lucene/search/PhraseQuery.html">PhraseQuery</a>
uses a slop factor to determine
+                            how many positions may occur between any two terms in the phrase
and still be considered a match.</p>
                         </li>
                         <li>
-                            <a href="api/org/apache/lucene/search/spans/SpanNearQuery.html">SpanNearQuery</a>
+                            <p><a href="api/org/apache/lucene/search/spans/SpanNearQuery.html">SpanNearQuery</a>
                             -- Matches a sequence of other
                             <a href="api/org/apache/lucene/search/spans/SpanQuery.html">SpanQuery</a>
-                            instances. The SpanNearQuery allows for much more
-                            complicated phrasal queries to be built since it is constructed
out of other SpanQuery
-                            objects, not just Terms.
+                            instances. <a href="api/org/apache/lucene/search/spans/SpanNearQuery.html">SpanNearQuery</a>
allows for much more
+                            complicated phrase queries since it is constructed from other
to <a href="api/org/apache/lucene/search/spans/SpanQuery.html">SpanQuery</a>
+                            instances, instead of only <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a>
instances.</p>
                         </li>
                     </ol>
                 </p>
@@ -414,41 +423,39 @@
                     exclusive range of a lower
                     <a href="api/org/apache/lucene/index/Term.html">Term</a>
                     and an upper
-                    <a href="api/org/apache/lucene/index/Term.html">Term</a>
-                    . For instance, one could find all documents
-                    that have terms beginning with the letters a through c. This type of
Query is most often used to
+                    <a href="api/org/apache/lucene/index/Term.html">Term</a>.
+                    For example, one could find all documents
+                    that have terms beginning with the letters <tt>a</tt> through
<tt>c</tt>. This type of <a href="api/org/apache/lucene/search/Query.html">Query</a>
is frequently used to
                     find
                     documents that occur in a specific date range.
                 </p>
                                                 <h4>
-                    <a href="api/org/apache/lucene/search/PrefixQuery.html">PrefixQuery</a>
-                    ,
+                    <a href="api/org/apache/lucene/search/PrefixQuery.html">PrefixQuery</a>,
                     <a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a>
                 </h4>
                                                 <p>While the
                     <a href="api/org/apache/lucene/search/PrefixQuery.html">PrefixQuery</a>
                     has a different implementation, it is essentially a special case of the
-                    <a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a>
-                    . The PrefixQuery allows an application
-                    to identify all documents with terms that begin with a certain string.
The WildcardQuery generalize
-                    this by allowing
-                    for the use of * and ? wildcards. Note that the WildcardQuery can be
quite slow. Also note that
-                    WildcardQuerys should
-                    not start with * and ?, as these are extremely slow. For tricks on how
to search using a wildcard at
+                    <a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a>.
+                    The <a href="api/org/apache/lucene/search/PrefixQuery.html">PrefixQuery</a>
allows an application
+                    to identify all documents with terms that begin with a certain string.
The <a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a>
generalizes this by allowing
+                    for the use of <tt>*</tt> (matches 0 or more characters)
and <tt>?</tt> (matches exactly one character) wildcards. Note that the <a
href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a> can be quite
slow. Also note that
+                    <a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a>
should
+                    not start with <tt>*</tt> and <tt>?</tt>, as
these are extremely slow. For tricks on how to search using a wildcard at
                     the beginning of a term, see
-                    <a href="http://www.gossamer-threads.com/lists/lucene/java-user/13373?search_string=WildcardQuery%20start;#13373">
+                    <a href="http://www.gossamer-threads.com/lists/lucene/java-user/13373#13373">
                         Starts With x and Ends With x Queries</a>
-                    from the Lucene archives.
+                    from the Lucene users's mailing list.
                 </p>
                                                 <h4>
                     <a href="api/org/apache/lucene/search/FuzzyQuery.html">FuzzyQuery</a>
                 </h4>
                                                 <p>A
                     <a href="api/org/apache/lucene/search/FuzzyQuery.html">FuzzyQuery</a>
-                    matches documents that contain similar terms to the specified term. Similarity
is
-                    determined using the
-                    <a href="http://en.wikipedia.org/wiki/Levenshtein">Levenshtein
(edit distance) algorithm</a>
-                    . This type of query can be useful when accounting for spelling variations
in the collection.
+                    matches documents that contain terms similar to the specified term. Similarity
is
+                    determined using
+                    <a href="http://en.wikipedia.org/wiki/Levenshtein">Levenshtein
(edit) distance</a>.
+                    This type of query can be useful when accounting for spelling variations
in the collection.
                 </p>
                             </blockquote>
       </td></tr>
@@ -462,33 +469,32 @@
       </td></tr>
       <tr><td>
         <blockquote>
-                                    <p>Chances are, the
-                    <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
is sufficient for all your searching needs.
-                    However, in some applications it may be necessary to alter your Similarity.
 For instance, some applications do not need to
-                    distinguish between shorter documents and longer documents (for example,
-                    see <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967?search_string=Similarity;#38967">a
"fair" similarity</a>)
-                    To change the Similarity, one must do so for both indexing and searching
and the changes must take place before
-                    any of these actions are undertaken (although in theory there is nothing
stopping you from changing mid-stream, it just isn't well-defined what is going to happen).
-                    To make this change, implement your Similarity (you probably want to
override
-                    <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>)
and then set the new
-                    class on
-                    <a href="api/org/apache/lucene/index/IndexWriter.html#setSimilarity(org.apache.lucene.search.Similarity)">IndexWriter.setSimilarity(org.apache.lucene.search.Similarity)</a>
for indexing and on
-                    <a href="api/org/apache/lucene/search/Searcher.html#setSimilarity(org.apache.lucene.search.Similarity)">Searcher.setSimilarity(org.apache.lucene.search.Similarity)</a>.
+                                    <p>Chances are <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
is sufficient for all your searching needs.
+                    However, in some applications it may be necessary to customize your <a
href="api/org/apache/lucene/search/Similarity.html">Similarity</a> implementation.
 For instance, some applications do not need to
+                    distinguish between shorter and longer documents (see <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a
"fair" similarity</a>).</p>
+                                                <p>To change <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>,
one must do so for both indexing and searching, and the changes must happen before
+                    either of these actions take place.  Although in theory there is nothing
stopping you from changing mid-stream, it just isn't well-defined what is going to happen.
+                    </p>
+                                                <p>To make this change, implement your
own <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> (likely
you'll want to simply subclass
+                    <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>)
and then use the new
+                    class by calling
+                    <a href="api/org/apache/lucene/index/IndexWriter.html#setSimilarity(org.apache.lucene.search.Similarity)">IndexWriter.setSimilarity</a>
before indexing and
+                    <a href="api/org/apache/lucene/search/Searcher.html#setSimilarity(org.apache.lucene.search.Similarity)">Searcher.setSimilarity</a>
before searching.
                 </p>
                                                 <p>
-                    If you are interested in use cases for changing your similarity, see
the mailing list at <a href="http://www.nabble.com/Overriding-Similarity-tf2128934.html">Overriding
Similarity</a>.
+                    If you are interested in use cases for changing your similarity, see
the Lucene users's mailing list at <a href="http://www.nabble.com/Overriding-Similarity-tf2128934.html">Overriding
Similarity</a>.
                     In summary, here are a few use cases:
                     <ol>
-                        <li>SweetSpotSimilarity -- SweetSpotSimilarity gives small
increases as the frequency increases a small amount
-                        and then greater increases when you hit the "sweet spot", i.e. where
you think the frequency of terms is more significant.</li>
-                        <li>Overriding tf -- In some applications, it doesn't matter
what the score of a document is as long as a matching term occurs.  In these
-                        cases people have overridden Similarity to return 1 from the tf()
method.</li>
-                        <li>Changing Length Normalization -- By overriding lengthNorm,
it is possible to discount how the length of a field contributes
-                        to a score.  In the DefaultSimilarity, lengthNorm = 1/ (numTerms
in field)^0.5, but if one changes this to be
+                        <li><p><a href="api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a>
-- <a href="api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a>
gives small increases as the frequency increases a small amount
+                        and then greater increases when you hit the "sweet spot", i.e. where
you think the frequency of terms is more significant.</p></li>
+                        <li><p>Overriding tf -- In some applications, it doesn't
matter what the score of a document is as long as a matching term occurs.  In these
+                        cases people have overridden Similarity to return 1 from the tf()
method.</p></li>
+                        <li><p>Changing Length Normalization -- By overriding
<a href="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,%20int)">lengthNorm</a>,
it is possible to discount how the length of a field contributes
+                        to a score.  In <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>,
lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
                         1 / (numTerms in field), all fields will be treated
-                            <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967?search_string=Similarity;#38967">"fairly"</a>.</li>
+                            <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">"fairly"</a>.</p></li>
                     </ol>
-                    In general, Chris Hostetter sums it up best in saying (from <a href="http://www.gossamer-threads.com/lists/lucene/java-user/39125">the
mailing list</a>):
+                    In general, Chris Hostetter sums it up best in saying (from <a href="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the
Lucene users's mailing list</a>):
                     <blockquote>[One would override the Similarity in] ... any situation
where you know more about your data then just that
                         it's "text" is a situation where it *might* make sense to to override
your
                         Similarity method.</blockquote>



Mime
View raw message