Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 31009 invoked from network); 7 Nov 2008 20:29:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 7 Nov 2008 20:29:24 -0000 Received: (qmail 92969 invoked by uid 500); 7 Nov 2008 20:29:24 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 92938 invoked by uid 500); 7 Nov 2008 20:29:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 92927 invoked by uid 99); 7 Nov 2008 20:29:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Nov 2008 12:29:24 -0800 X-ASF-Spam-Status: No, hits=4.5 required=10.0 tests=DNS_FROM_RFC_BOGUSMX,HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_MED,RCVD_NUMERIC_HELO,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of TSturge@hi5.com designates 64.18.1.38 as permitted sender) Received: from [64.18.1.38] (HELO psmtp.com) (64.18.1.38) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 07 Nov 2008 20:28:05 +0000 Received: from source ([63.240.6.3]) (using TLSv1) by exprod6ob117.postini.com ([64.18.5.12]) with SMTP ID DSNKSRSkckD2cp9BeuOckA9HzyZcHBySOD6q@postini.com; Fri, 07 Nov 2008 12:28:47 PST Received: from d01smtp06.Mi8.com ([172.16.1.239]) by Outbound01.Mi8.com with Microsoft SMTPSVC(6.0.3790.3959); Fri, 7 Nov 2008 15:26:25 -0500 Received: from MI8NYCMAIL04.Mi8.com ([172.16.1.157]) by d01smtp06.Mi8.com with Microsoft SMTPSVC(6.0.3790.3959); Fri, 7 Nov 2008 15:26:25 -0500 Received: from 66.218.169.47 ([66.218.169.47]) by MI8NYCMAIL04.Mi8.com ([172.16.1.204]) via Exchange Front-End Server mi8owa.mi8.com ([172.16.1.104]) with Microsoft Exchange Server HTTP-DAV ; Fri, 7 Nov 2008 20:26:25 +0000 User-Agent: Microsoft-Entourage/12.13.0.080930 Date: Fri, 07 Nov 2008 12:26:22 -0800 Subject: Term numbering and range filtering From: Tim Sturge To: Message-ID: Thread-Topic: Term numbering and range filtering Thread-Index: AclBFxk5NF04ic85dUu977IUyCu9IA== Mime-version: 1.0 Content-type: multipart/alternative; boundary="B_3308905583_123885350" X-OriginalArrivalTime: 07 Nov 2008 20:26:25.0635 (UTC) FILETIME=[1B63D330:01C94117] X-Virus-Checked: Checked by ClamAV on apache.org --B_3308905583_123885350 Content-type: text/plain; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable Hi, I=B9m wondering if there is any easy technique to number the terms in an inde= x (By number I mean map a sequence of terms to a contiguous range of integers and map terms to these numbers efficiently) Looking at the Term class and the .tis/.tii index format it appears that th= e terms are stored in an ordered and prefix-compressed format, but while ther= e are pointers from a term to the .frq and .prx files, neither is really suitable as a sequence number. The reason I have this question is that I am writing a multi-filter for single term fields. My index contains many fields for which each document contains a single term (e.g. date, zipcode, country) and I need to perform range queries or set matches over these fields, many of which are very inclusive (they match >10% of the total documents) A cached RangeFilter works well when there are a small number of potential options (e.g. for countries) but when there are many options (consider a date range or a set of zipcodes) there are too many potential choices to cache each possibility and it is too inefficient to build a filter on the fly for each query (as you have to visit 10% of documents to build the filter despite the query itself matching 0.1%) Therefore I was considering building a int[reader.maxDocs()] array for each field and putting into it the term number for each document. This relies on the fact that each document contains only a single term for this field, but with it I should be able to quickly construct a =B3multi-filter=B2 (that is, something that iterates the array and checks that the term is in the range or set). Right now it looks like I can do some very ugly surgery and perhaps use the offset to the prx file even though it is not contiguous. But I=B9m hoping there is a better technique that I=B9m just not seeing right now. Thanks, Tim --B_3308905583_123885350--