Return-Path:
Of the various implementations of
- Query, the
- TermQuery
- is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified
- Term,
- which is a word that occurs in a certain
- Field.
- Thus, a TermQuery identifies and scores all
- Documents that have a Field with the specified string in it.
- Constructing a TermQuery
- is as simple as:
- Things start to get interesting when one combines multiple
- TermQuery instances into a BooleanQuery.
- A BooleanQuery contains multiple
- BooleanClauses,
- where each clause contains a sub-query (Query
- instance) and an operator (from BooleanClause.Occur)
- describing how that sub-query is combined with the other clauses:
- SHOULD -- Use this operator when a clause can occur in the result set, but is not required.
- If a query is made up of all SHOULD clauses, then every document in the result
- set matches at least one of these clauses. MUST -- Use this operator when a clause is required to occur in the result set. Every
- document in the result set will match
- all such clauses. MUST NOT -- Use this operator when a
- clause must not occur in the result set. No
- document in the result set will match
- any such clauses. Another common search is to find documents containing certain phrases. This
- is handled in two different ways.
- PhraseQuery
- -- Matches a sequence of
- Terms.
- PhraseQuery uses a slop factor to determine
- how many positions may occur between any two terms in the phrase and still be considered a match. SpanNearQuery
- -- Matches a sequence of other
- SpanQuery
- instances. SpanNearQuery allows for much more
- complicated phrase queries since it is constructed from other to SpanQuery
- instances, instead of only TermQuery instances. The
- RangeQuery
- matches all documents that occur in the
- exclusive range of a lower
- Term
- and an upper
- Term.
- For example, one could find all documents
- that have terms beginning with the letters a through c. This type of Query is frequently used to
- find
- documents that occur in a specific date range.
- While the
- PrefixQuery
- has a different implementation, it is essentially a special case of the
- WildcardQuery.
- The PrefixQuery allows an application
- to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing
- for the use of * (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that
- WildcardQuery should
- not start with * and ?, as these are extremely slow. For tricks on how to search using a wildcard at
- the beginning of a term, see
-
- Starts With x and Ends With x Queries
- from the Lucene users's mailing list.
- A
- FuzzyQuery
- matches documents that contain terms similar to the specified term. Similarity is
- determined using
- Levenshtein (edit) distance.
- This type of query can be useful when accounting for spelling variations in the collection.
+ For information on the Query Classes, refer to the
+ search package javadocs
Chances are DefaultSimilarity is sufficient for all your searching needs.
- However, in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need to
- distinguish between shorter and longer documents (see a "fair" similarity). To change Similarity, one must do so for both indexing and searching, and the changes must happen before
- either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen.
- To make this change, implement your own Similarity (likely you'll want to simply subclass
- DefaultSimilarity) and then use the new
- class by calling
- IndexWriter.setSimilarity before indexing and
- Searcher.setSimilarity before searching.
-
- If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity.
- In summary, here are a few use cases:
- SweetSpotSimilarity -- SweetSpotSimilarity gives small increases as the frequency increases a small amount
- and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant. Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these
- cases people have overridden Similarity to return 1 from the tf() method. Changing Length Normalization -- By overriding lengthNorm, it is possible to discount how the length of a field contributes
- to a score. In DefaultSimilarity, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
- 1 / (numTerms in field), all fields will be treated
- "fairly". One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on
+ how to do this, see the
+ search package javadocs Changing scoring is an expert level task, so tread carefully and be prepared to share your code if
- you want help.
- With the warning out of the way, it is possible to change a lot more than just the Similarity
- when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by
- three main classes:
- At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more
+ about how to do this, refer to the
+ search package javadocs
In some sense, the
- Query
- class is where it all begins. Without a Query, there would be
- nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it
- is often responsible
- for creating them or coordinating the functionality between them. The
- Query class has several methods that are important for
- derived classes:
- The
- Weight
- interface provides an internal representation of the Query so that it can be reused. Any
- Searcher
- dependent state should be stored in the Weight implementation,
- not in the Query class. The interface defines 6 methods that must be implemented:
- The
- Scorer
- abstract class provides common scoring functionality for all Scorer implementations and
- is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which
- must be implemented:
- In a nutshell, you want to add your own custom Query implementation when you think that Lucene's
- aren't appropriate for the
- task that you want to do. You might be doing some cutting edge research or you need more information
- back
- out of Lucene (similar to Doug adding SpanQuery functionality). FILL IN HERE
+
@@ -469,36 +353,9 @@
-
- TermQuery
-
-
- TermQuery tq = new TermQuery(new Term("fieldName", "term");
-
In this example, the Query identifies all Documents that have the Field named "fieldName" and
- contain the word "term".
-
-
- BooleanQuery
-
-
-
-
- Boolean queries are constructed by adding two or more
- BooleanClause
- instances. If too many clauses are added, a TooManyClauses
- exception will be thrown during searching. This most often occurs
- when a Query
- is rewritten into a BooleanQuery with many
- TermQuery clauses,
- for example by WildcardQuery.
- The default setting for the maximum number
- of clauses 1024, but this can be changed via the
- static method setMaxClauseCount
- in BooleanQuery.
-
- Phrases
-
-
-
-
- RangeQuery
-
-
- PrefixQuery,
- WildcardQuery
-
-
- FuzzyQuery
-
-
-
-
- In general, Chris Hostetter sums it up best in saying (from the Lucene users's mailing list):
- [One would override the Similarity in] ... any situation where you know more about your data then just that
- it's "text" is a situation where it *might* make sense to to override your
- Similarity method.
-
+
@@ -516,169 +373,10 @@
Modified: lucene/java/trunk/docs/systemproperties.html
URL: http://svn.apache.org/viewvc/lucene/java/trunk/docs/systemproperties.html?view=diff&rev=442406&r1=442405&r2=442406
==============================================================================
--- lucene/java/trunk/docs/systemproperties.html (original)
+++ lucene/java/trunk/docs/systemproperties.html Mon Sep 11 18:05:20 2006
@@ -86,6 +86,8 @@
-
-
- Details on each of these classes, and their children can be found in the subsections below.
+
-
-
-
-
- The Query Class
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- The Weight Interface
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- The Scorer Class
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Why would I want to add my own Query?
-
-
-
-
-
-
-
-
-
-
- Examples
-
-
-
-
-
-
- Table Of Contents
+
Search over indices. Applications usually call {@link org.apache.lucene.search.Searcher#search(Query)} or {@link org.apache.lucene.search.Searcher#search(Query,Filter)}. + + +
+ +Of the various implementations of + Query, the + TermQuery + is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the + specified + Term, + which is a word that occurs in a certain + Field. + Thus, a TermQuery identifies and scores all + Documents that have a Field with the specified string in it. + Constructing a TermQuery + is as simple as: +
+ TermQuery tq = new TermQuery(new Term("fieldName", "term"); +In this example, the Query identifies all Documents that have the Field named "fieldName" and + contain the word "term". + +
Things start to get interesting when one combines multiple + TermQuery instances into a BooleanQuery. + A BooleanQuery contains multiple + BooleanClauses, + where each clause contains a sub-query (Query + instance) and an operator (from BooleanClause.Occur) + describing how that sub-query is combined with the other clauses: +
SHOULD -- Use this operator when a clause can occur in the result set, but is not required. + If a query is made up of all SHOULD clauses, then every document in the result + set matches at least one of these clauses.
MUST -- Use this operator when a clause is required to occur in the result set. Every + document in the result set will match + all such clauses.
MUST NOT -- Use this operator when a + clause must not occur in the result set. No + document in the result set will match + any such clauses.
Another common search is to find documents containing certain phrases. This + is handled in two different ways. +
PhraseQuery + -- Matches a sequence of + Terms. + PhraseQuery uses a slop factor to determine + how many positions may occur between any two terms in the phrase and still be considered a match.
+SpanNearQuery + -- Matches a sequence of other + SpanQuery + instances. SpanNearQuery allows for + much more + complicated phrase queries since it is constructed from other to SpanQuery + instances, instead of only TermQuery + instances.
+The + RangeQuery + matches all documents that occur in the + exclusive range of a lower + Term + and an upper + Term. + For example, one could find all documents + that have terms beginning with the letters a through c. This type of Query is frequently used to + find + documents that occur in a specific date range. +
+While the + PrefixQuery + has a different implementation, it is essentially a special case of the + WildcardQuery. + The PrefixQuery allows an application + to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing + for the use of * (matches 0 or more characters) and ? (matches exactly one character) wildcards. + Note that the WildcardQuery can be quite slow. Also + note that + WildcardQuery should + not start with * and ?, as these are extremely slow. For tricks on how to search using a wildcard + at + the beginning of a term, see + + Starts With x and Ends With x Queries + from the Lucene users's mailing list. +
+A + FuzzyQuery + matches documents that contain terms similar to the specified term. Similarity is + determined using + Levenshtein (edit) distance. + This type of query can be useful when accounting for spelling variations in the collection. +
+ +Chances are DefaultSimilarity is sufficient for all + your searching needs. + However, in some applications it may be necessary to customize your Similarity implementation. For instance, some + applications do not need to + distinguish between shorter and longer documents (see a "fair" similarity).
+ +To change Similarity, one must do so for both indexing and + searching, and the changes must happen before + either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it + just isn't well-defined what is going to happen. +
+ +To make this change, implement your own Similarity (likely + you'll want to simply subclass + DefaultSimilarity) and then use the new + class by calling + IndexWriter.setSimilarity + before indexing and + Searcher.setSimilarity + before searching. +
+ ++ If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. + In summary, here are a few use cases: +
SweetSpotSimilarity -- SweetSpotSimilarity gives small increases + as the frequency increases a small amount + and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is + more significant.
Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a + matching term occurs. In these + cases people have overridden Similarity to return 1 from the tf() method.
Changing Length Normalization -- By overriding lengthNorm, + it is possible to discount how the length of a field contributes + to a score. In DefaultSimilarity, + lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be + 1 / (numTerms in field), all fields will be treated + "fairly".
[One would override the Similarity in] ... any situation where you know more about your data then just + that + it's "text" is a situation where it *might* make sense to to override your + Similarity method.+ + +
Changing scoring is an expert level task, so tread carefully and be prepared to share your code if + you want help. +
+ +With the warning out of the way, it is possible to change a lot more than just the Similarity + when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by + three main classes: +
In some sense, the + Query + class is where it all begins. Without a Query, there would be + nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it + is often responsible + for creating them or coordinating the functionality between them. The + Query class has several methods that are important for + derived classes: +
The + Weight + interface provides an internal representation of the Query so that it can be reused. Any + Searcher + dependent state should be stored in the Weight implementation, + not in the Query class. The interface defines 6 methods that must be implemented: +
The + Scorer + abstract class provides common scoring functionality for all Scorer implementations and + is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which + must be implemented: +
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's + aren't appropriate for the + task that you want to do. You might be doing some cutting edge research or you need more information + back + out of Lucene (similar to Doug adding SpanQuery functionality).
+FILL IN HERE