Return-Path: Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 96389 invoked from network); 11 Oct 2001 20:29:27 -0000 Received: from unknown (HELO mta.12.com) (65.198.8.41) by daedalus.apache.org with SMTP; 11 Oct 2001 20:29:27 -0000 Received: (qmail 910 invoked from network); 11 Oct 2001 20:26:55 -0000 Received: from unknown (HELO riker.grandcentral.com) (10.102.15.55) by mta.12.com with SMTP; 11 Oct 2001 20:26:55 -0000 Received: by mail.grandcentral.com with Internet Mail Service (5.5.2653.19) id <42Y1HS9B>; Thu, 11 Oct 2001 13:17:47 -0700 Message-ID: <4BC270C6AB8AD411AD0B00B0D0493DF0EE7C6B@mail.grandcentral.com> From: Doug Cutting To: "'lucene-dev@jakarta.apache.org'" Subject: RE: Token retrieval question Date: Thu, 11 Oct 2001 13:17:43 -0700 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset="iso-8859-1" X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N > From: Dmitry Serebrennikov [mailto:dmitrys@earthlink.net] > > Doug, thanks for posting these. I may end up going in this > direction in > the next few days and will use this as a blueprint. Maybe I'll end up > putting in the first pass implementation and then you can > later further > tune it when you get to it. Great! One implementation tip: when merging terms from segments, build an array of ints for each segment, indexed by term number. These map from old segment term numbers to new term numbers in the merged index. Then merging vectors is really easy: just re-number them using the array for their segment. Vectors can be merged in a single pass through the vector file for each segment, writing the new vector file in a single pass. > Question on term numbers through: what would be an approach > for merging > these across multiple IndexReaders for the purposes of MultiSearcher? As you imply, it is possible to seek a SegmentTermEnum to a term number, but not a SegmentsTermEnum. This could be fixed in a number of ways. The simplest and fastest would be to declare that term numbers are unavaliable for unoptimized indexes and throw an exception. A slower, kinder approach would be to, the first time this method is called, iterate through all of the terms. One could either save all of the terms in an array, which would be fastest, but use a lot of memory, or one could save every, say, 128th term in an array. Then, to find the nth term, do a binary search of this array for the term before it. Then you can seek all of the sub-enums to that term and then merge them up to the desired term, counting as you go. That's probably the best compromise: it's probably fast enough, and it doesn't use too much memory. Note that, for good performance, clustering algorithms etc. should operate only on document and term numbers. These integers should only be mapped to Term and Document objects when they are displayed to the user. Thus the performance requirements for that mapping are not extreme. Lucene uses a similar strategy to keep search fast: internally documents are referred to by number: only when a Hit is displayed is it converted to a Document object. Doug