Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 97441 invoked from network); 11 Apr 2007 00:29:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Apr 2007 00:29:56 -0000 Received: (qmail 6217 invoked by uid 500); 11 Apr 2007 00:30:00 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 6181 invoked by uid 500); 11 Apr 2007 00:30:00 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 6170 invoked by uid 99); 11 Apr 2007 00:30:00 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Apr 2007 17:30:00 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Apr 2007 17:29:53 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 8BBBF714071 for ; Tue, 10 Apr 2007 17:29:32 -0700 (PDT) Message-ID: <24926326.1176251372569.JavaMail.jira@brutus> Date: Tue, 10 Apr 2007 17:29:32 -0700 (PDT) From: "Hoss Man (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries In-Reply-To: <18716273.1175550812550.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487962 ] Hoss Man commented on LUCENE-855: --------------------------------- On Mon, 9 Apr 2007, Otis Gospodnetic (JIRA) wrote: : I'd love to know what Hoss and other big Filter users think about this. : Solr makes a lof of use of (Range?)Filters, I believe. This is one of those Jira issues that i didn't really have time to follow when it was first opened, and so the Jira emails have just been piling up waiting ofr me to read. Here's the raw notes i took as i read through the patches... ---------------- FieldCacheRangeFilter.patch from 10/Apr/07 01:52 PM * javadoc cut/paste errors (FieldCache) * FieldCacheRangeFilter should work with simple strings (using FieldCache.getStrings or FieldCache.getStringIndex) just like regular RangeFilter * it feels like the various parser versions should be in seperate subclasses (common abstract base class?) * why does clone need to construct a raw BitSet? what exactly didn't work about ChainedFilter without this? (could cause other BitSet usage problems) * or/and/andNot/xor can all be implemented using convertToBitSet * need FieldCacheBitSet methods: cardinality, get(int,int) * need equals and hashCode methods in all new classes * FieldCacheBitSet.clear should be UnsuppOp * convertToBitSet can be cached. * FieldCacheBitSet should be abstract, requiring get(int) be implemented MemoryCachedRangeFilter_1.4.patch from 06/Apr/07 06:14 AM * "tuples" should be initialized to fieldCache.length ... serious ArrayList resizing going on there (why is it an ArrayList, why not just Tules[] ?) * doesn't "cache" need synchronization? ... seems like the same CreationPlaceholder pattern used in FieldCache might make sense here. * this looks wrong... } else if ( (!includeLower) && (lowerIndex >= 0) ) { ...consider case where lower==5, includeLower==false, and all values in index are 5, binary search could leave us in the middle of hte index, so we still need for move forward to the end? * ditto above concern for finding upperIndex * what is pathological worst case for rewind/forward when *lots* of duplicate values in index? should another binarySearch be used? * a lot of code in MemoryCachedRangeFilter.bits for finding lowerIndex/upperIndex would probably make more sense as methods in SortedFieldCache * only seems to handle longs, at a minimum should deal with arbitrary strings, with optional add ons for longs/ints/etc... * I can't help but wonder how MemoryCachedRangeFilter would compare if it used Solr's OpenBitSet (facaded to implement the BitSet API) TestRangeFilterPerformanceComparison.java from 10/Apr/07 * I can't help but wonder how RangeFilter would compare if it used Solr's OpenBitSet (facaded to implement the BitSet API) * no test of includeLower==false or includeUpper==false * i don't think the ranges being compared are the same for RangeFilter as they are for the other Filters ... note the use of DateTools when building the index, vs straight string usage in RangeFilter, vs Long.parseLong in MemoryCachedRangeFilter and FieldCacheRangeFilter * is it really a fair comparison to call MemoryCachedRangeFilter.warmup or FieldCacheRangeFilter.bits outside of the timing code? for indexes where the IndexReader is reopened periodicaly this may be a significant number to be aware of. ---------------- Questions about the legitimacy of the testing aside... In general, I like the approach of FieldCacheBitSet -- but it should be generalized into an "AbstractReadOnlyBitSet" where all methods are implemented via get(int) in subclasses -- we should make sure that every method in the BitSet API works as advertised in Java1.4. I don't really like the various hoops FieldCacheRangeFilter has to jump through to support int/float/long ... I think at it's core it should support simple Strings, with alternate/sub classes for dealing with other FieldCache formats ... i just really dislike all the crazy nested ifs to deal with the different Parser types, if there's going to be separate constructors for longs/floats/ints, they might as well be separate sub-classes. the really nice thing this has over RangeFilter is that people can index raw numeric values without needing to massage them into lexicographically ordered Strings (since the FieldCache will take care of parsing them appropriately) My gut tells me that the MemoryCachedRangeFilter approach will never ever be able to compete with the FieldCacheRangeFilter facading BitSet approach since it needs to build the FieldCache, then the SortedFieldCache, then a BitSet ...it seems like any optimization into that pipeline can always be beaten by using the same logic, but then facading the BitSet > MemoryCachedRangeFilter to boost performance of Range queries > ------------------------------------------------------------- > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Affects Versions: 2.1 > Reporter: Andy Liu > Assigned To: Otis Gospodnetic > Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org