lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <>
Subject Re: Token retrieval question
Date Thu, 11 Oct 2001 21:30:32 GMT

Doug Cutting wrote:

>>From: Dmitry Serebrennikov []
>>Doug, thanks for posting these. I may end up going in this 
>>direction in 
>>the next few days and will use this as a blueprint. Maybe I'll end up 
>>putting in the first pass implementation and then you can 
>>later further 
>>tune it when you get to it.
>Great!  One implementation tip: when merging terms from segments, build an
>array of ints for each segment, indexed by term number.  These map from old
>segment term numbers to new term numbers in the merged index.  Then merging
>vectors is really easy: just re-number them using the array for their
>segment.  Vectors can be merged in a single pass through the vector file for
>each segment, writing the new vector file in a single pass.
Ok, got it.

>>Question on term numbers through: what would be an approach 
>>for merging 
>>these across multiple IndexReaders for the purposes of MultiSearcher?
>As you imply, it is possible to seek a SegmentTermEnum to a term number, but
>not a SegmentsTermEnum.  
Did I imply that? :)
I was just thinking about numbering, but the tip above suggests that the 
terms will be fully renumbered when looking at them from the 
MultiSearcher. I think that is ok. Documents are assigned ranges 
instead, and we could do this for Terms since the term numbers probably 
do not need to be ordered the same way as the terms.

>This could be fixed in a number of ways.  The
>simplest and fastest would be to declare that term numbers are unavaliable
>for unoptimized indexes and throw an exception.  A slower, kinder approach
>would be to, the first time this method is called, iterate through all of
>the terms.  One could either save all of the terms in an array, which would
>be fastest, but use a lot of memory, or one could save every, say, 128th
>term in an array.  Then, to find the nth term, do a binary search of this
>array for the term before it.  Then you can seek all of the sub-enums to
>that term and then merge them up to the desired term, counting as you go.
>That's probably the best compromise: it's probably fast enough, and it
>doesn't use too much memory.
>Note that, for good performance, clustering algorithms etc. should operate
>only on document and term numbers.  These integers should only be mapped to
>Term and Document objects when they are displayed to the user.  Thus the
>performance requirements for that mapping are not extreme.  Lucene uses a
>similar strategy to keep search fast: internally documents are referred to
>by number: only when a Hit is displayed is it converted to a Document
Yes, I see that. One additional problem that I need to solve for my 
application is that I need to map from stemmed forms of the terms to at 
least one un-stemmed form. Ideally it would be all un-stemmed forms, but 
I can live with the first one. I realize that Lucene does not ealisy 
support this because of the separation of church and state (I mean the 
term filtering prior to indexing and querying), but I still need this 
functionality... So, the question is, is this going to be common enough 
to add a concept of a TermDictionary to Lucene and provide methods to 
access it on the IndexReader and IndexWriter? If not, I could implement 
this externally, but then I would not be able to use the IO framework 
and whole concept of directories. Also, since the Term numbers are going 
to be euphemeral just like doc numbers, externally I would have to refer 
to them by text, slowing dow the translation process, etc., etc., etc..

It's not yet clear enough in my mind to put an API together. Maybe the 
way to do this is to create and Analyzer that outputs a subclass of Term 
that has additional data, namely: String original_text, and int data. 
The data int is to keep application-specific flags such as term 
classification. Then the indexing code can be extended to support these 
extra fields and maintain the TermDictionary with them. The first entry 
for a given term wins in terms of the original_text and the data int.

Any ideas to make this less of a hack?



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message