lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mikemcc...@apache.org
Subject svn commit: r738862 - /lucene/java/trunk/src/java/org/apache/lucene/search/FieldCacheTermsFilter.java
Date Thu, 29 Jan 2009 14:13:35 GMT
Author: mikemccand
Date: Thu Jan 29 14:13:35 2009
New Revision: 738862

URL: http://svn.apache.org/viewvc?rev=738862&view=rev
Log:
LUCENE-1487: improve javadoc for FieldCacheTermsFilter

Modified:
    lucene/java/trunk/src/java/org/apache/lucene/search/FieldCacheTermsFilter.java

Modified: lucene/java/trunk/src/java/org/apache/lucene/search/FieldCacheTermsFilter.java
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/FieldCacheTermsFilter.java?rev=738862&r1=738861&r2=738862&view=diff
==============================================================================
--- lucene/java/trunk/src/java/org/apache/lucene/search/FieldCacheTermsFilter.java (original)
+++ lucene/java/trunk/src/java/org/apache/lucene/search/FieldCacheTermsFilter.java Thu Jan
29 14:13:35 2009
@@ -18,31 +18,82 @@
  */
 
 import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.TermDocs;          // for javadoc
 import org.apache.lucene.util.OpenBitSet;
 
 import java.io.IOException;
 import java.util.Iterator;
 
 /**
- * A term filter built on top of a cached single field (in FieldCache). It can be used only
- * with single-valued fields.
+ * A {@link Filter} that only accepts documents whose single
+ * term value in the specified field is contained in the
+ * provided set of allowed terms.
+ * 
  * <p/>
- * FieldCacheTermsFilter builds a single cache for the field the first time it is used. Each
- * subsequent FieldCacheTermsFilter on the same field then re-uses this cache even if the
terms
- * themselves are different.
- * <p/>
- * The FieldCacheTermsFilter is faster than building a TermsFilter each time.
- * FieldCacheTermsFilter are fast to build in cases where number of documents are far more
than
- * unique terms. Internally, it creates a BitSet by term number and scans by document id.
- * <p/>
- * As with all FieldCache based functionality, FieldCacheTermsFilter is only valid for fields
- * which contain zero or one terms for each document. Thus it works on dates, prices and
other
- * single value fields but will not work on regular text fields. It is preferable to use
an
- * NOT_ANALYZED field to ensure that there is only a single term.
+ * 
+ * This is the same functionality as TermsFilter (from
+ * contrib/queries), except this filter requires that the
+ * field contains only a single term for all documents.
+ * Because of drastically different implementations, they
+ * also have different performance characteristics, as
+ * described below.
+ * 
  * <p/>
- * Also, collation is performed at the time the FieldCache is built; to change collation
you
- * need to override the getFieldCache() method to change the underlying cache.
+ * 
+ * The first invocation of this filter on a given field will
+ * be slower, since a {@link FieldCache.StringIndex} must be
+ * created.  Subsequent invocations using the same field
+ * will re-use this cache.  However, as with all
+ * functionality based on {@link FieldCache}, persistent RAM
+ * is consumed to hold the cache, and is not freed until the
+ * {@link IndexReader} is closed.  In contrast, TermsFilter
+ * has no persistent RAM consumption.
+ * 
+ * 
+ * <p/>
+ * 
+ * With each search, this filter translates the specified
+ * set of Terms into a private {@link OpenBitSet} keyed by
+ * term number per unique {@link IndexReader} (normally one
+ * reader per segment).  Then, during matching, the term
+ * number for each docID is retrieved from the cache and
+ * then checked for inclusion using the {@link OpenBitSet}.
+ * Since all testing is done using RAM resident data
+ * structures, performance should be very fast, most likely
+ * fast enough to not require further caching of the
+ * DocIdSet for each possible combination of terms.
+ * However, because docIDs are simply scanned linearly, an
+ * index with a great many small documents may find this
+ * linear scan too costly.
+ * 
+ * <p/>
+ * 
+ * In contrast, TermsFilter builds up an {@link OpenBitSet},
+ * keyed by docID, every time it's created, by enumerating
+ * through all matching docs using {@link TermDocs} to seek
+ * and scan through each term's docID list.  While there is
+ * no linear scan of all docIDs, besides the allocation of
+ * the underlying array in the {@link OpenBitSet}, this
+ * approach requires a number of "disk seeks" in proportion
+ * to the number of terms, which can be exceptionally costly
+ * when there are cache misses in the OS's IO cache.
+ * 
+ * <p/>
+ * 
+ * Generally, this filter will be slower on the first
+ * invocation for a given field, but subsequent invocations,
+ * even if you change the allowed set of Terms, should be
+ * faster than TermsFilter, especially as the number of
+ * Terms being matched increases.  If you are matching only
+ * a very small number of terms, and those terms in turn
+ * match a very small number of documents, TermsFilter may
+ * perform faster.
+ *
+ * <p/>
+ *
+ * Which filter is best is very application dependent.
  */
+
 public class FieldCacheTermsFilter extends Filter {
   private String field;
   private Iterable terms;



Mime
View raw message