Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <24926326.1176251372569.JavaMail.jira@brutus>
Date: Tue, 10 Apr 2007 17:29:32 -0700 (PDT)
From: "Hoss Man (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost
 performance of Range queries
In-Reply-To: <18716273.1175550812550.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487962 ] 

Hoss Man commented on LUCENE-855:
---------------------------------

On Mon, 9 Apr 2007, Otis Gospodnetic (JIRA) wrote:

: I'd love to know what Hoss and other big Filter users think about this.
: Solr makes a lof of use of (Range?)Filters, I believe.

This is one of those Jira issues that i didn't really have time to follow when it was first opened, and so the Jira emails have just been piling up waiting ofr me to read.

Here's the raw notes i took as i read through the patches...

----------------
FieldCacheRangeFilter.patch  from 10/Apr/07 01:52 PM

 * javadoc cut/paste errors (FieldCache)
 * FieldCacheRangeFilter should work with simple strings
   (using FieldCache.getStrings or FieldCache.getStringIndex)
   just like regular RangeFilter
 * it feels like the various parser versions should be in
   seperate subclasses (common abstract base class?)
 * why does clone need to construct a raw BitSet?  what exactly didn't
   work about ChainedFilter without this?
   (could cause other BitSet usage problems)
 * or/and/andNot/xor can all be implemented using convertToBitSet
 * need FieldCacheBitSet methods: cardinality, get(int,int)
 * need equals and hashCode methods in all new classes
 * FieldCacheBitSet.clear should be UnsuppOp
 * convertToBitSet can be cached.
 * FieldCacheBitSet should be abstract, requiring get(int) be implemented


MemoryCachedRangeFilter_1.4.patch from 06/Apr/07 06:14 AM

 * "tuples" should be initialized to fieldCache.length ... serious
   ArrayList resizing going on there
   (why is it an ArrayList, why not just Tules[] ?)
 * doesn't "cache" need synchronization? ... seems like the same
   CreationPlaceholder pattern used in FieldCache might make sense here.
 * this looks wrong...
     } else if ( (!includeLower) && (lowerIndex >= 0) ) {
   ...consider case where lower==5, includeLower==false, and all values
   in index are 5, binary search could leave us in the middle of hte index,
   so we still need for move forward to the end?
 * ditto above concern for finding upperIndex
 * what is pathological worst case for rewind/forward when *lots* of
   duplicate values in index?  should another binarySearch be used?
 * a lot of code in MemoryCachedRangeFilter.bits for finding
   lowerIndex/upperIndex would probably make more sense as methods in
   SortedFieldCache
 * only seems to handle longs, at a minimum should deal with arbitrary
   strings, with optional add ons for longs/ints/etc...
 * I can't help but wonder how MemoryCachedRangeFilter would compare if it
   used Solr's OpenBitSet (facaded to implement the BitSet API)

TestRangeFilterPerformanceComparison.java   from 10/Apr/07

 * I can't help but wonder how RangeFilter would compare if it used Solr's
   OpenBitSet (facaded to implement the BitSet API)
 * no test of includeLower==false or includeUpper==false
 * i don't think the ranges being compared are the same for RangeFilter as they 
   are for the other Filters ... note the use of DateTools when building the index, 
   vs straight string usage in RangeFilter, vs Long.parseLong in 
   MemoryCachedRangeFilter and FieldCacheRangeFilter
 * is it really a fair comparison to call MemoryCachedRangeFilter.warmup
   or FieldCacheRangeFilter.bits outside of the timing code?
   for indexes where the IndexReader is reopened periodicaly this may
   be a significant number to be aware of.
----------------

Questions about the legitimacy of the testing aside...

In general, I like the approach of FieldCacheBitSet -- but it should be generalized into an "AbstractReadOnlyBitSet" where all methods are implemented via get(int) in subclasses -- we should make sure that every method in the BitSet API works as advertised in Java1.4.  

I don't really like the various hoops FieldCacheRangeFilter has to jump through to support int/float/long ... I think at it's core it should support simple Strings, with alternate/sub classes for dealing with other FieldCache formats ... i just really dislike all the crazy nested ifs to deal with the different Parser types, if there's going to be separate constructors for longs/floats/ints, they might as well be separate sub-classes.

the really nice thing this has over RangeFilter is that people can index raw numeric values without needing to massage them into lexicographically ordered Strings (since the FieldCache will take care of parsing them appropriately) 

My gut tells me that the MemoryCachedRangeFilter approach will never ever be able to compete with the FieldCacheRangeFilter facading BitSet approach since it needs to build the FieldCache, then the SortedFieldCache, then a BitSet ...it seems like any optimization into that pipeline can always be beaten by using the same logic, but then facading the BitSet


> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org