Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Message-ID: <4BC270C6AB8AD411AD0B00B0D0493DF0EE7C6B@mail.grandcentral.com>
From: Doug Cutting <DCutting@grandcentral.com>
To: "'lucene-dev@jakarta.apache.org'" <lucene-dev@jakarta.apache.org>
Subject: RE: Token retrieval question
Date: Thu, 11 Oct 2001 13:17:43 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"

> From: Dmitry Serebrennikov [mailto:dmitrys@earthlink.net]
> 
> Doug, thanks for posting these. I may end up going in this 
> direction in 
> the next few days and will use this as a blueprint. Maybe I'll end up 
> putting in the first pass implementation and then you can 
> later further 
> tune it when you get to it.

Great!  One implementation tip: when merging terms from segments, build an
array of ints for each segment, indexed by term number.  These map from old
segment term numbers to new term numbers in the merged index.  Then merging
vectors is really easy: just re-number them using the array for their
segment.  Vectors can be merged in a single pass through the vector file for
each segment, writing the new vector file in a single pass.

> Question on term numbers through: what would be an approach 
> for merging 
> these across multiple IndexReaders for the purposes of MultiSearcher?

As you imply, it is possible to seek a SegmentTermEnum to a term number, but
not a SegmentsTermEnum.  This could be fixed in a number of ways.  The
simplest and fastest would be to declare that term numbers are unavaliable
for unoptimized indexes and throw an exception.  A slower, kinder approach
would be to, the first time this method is called, iterate through all of
the terms.  One could either save all of the terms in an array, which would
be fastest, but use a lot of memory, or one could save every, say, 128th
term in an array.  Then, to find the nth term, do a binary search of this
array for the term before it.  Then you can seek all of the sub-enums to
that term and then merge them up to the desired term, counting as you go.
That's probably the best compromise: it's probably fast enough, and it
doesn't use too much memory.

Note that, for good performance, clustering algorithms etc. should operate
only on document and term numbers.  These integers should only be mapped to
Term and Document objects when they are displayed to the user.  Thus the
performance requirements for that mapping are not extreme.  Lucene uses a
similar strategy to keep search fast: internally documents are referred to
by number: only when a Hit is displayed is it converted to a Document
object.

Doug