From Marvin Humphrey <>
Subject Re: Flexible Indexing (was Re: Lucene Planning)
Date Fri, 02 Jun 2006 15:50:10 GMT

On Jun 2, 2006, at 6:48 AM, Grant Ingersoll wrote:
> I thought it was you, but wasn't sure.

I'm always looking for ways to minimize Term Vectors, because I  
consider excerpting/highlighting a core feature rather than an add- 
on, and they seem like such overkill.  It bothers me that they  
duplicate so much information.

I've been toying with the idea of a hitCollector.collect(int docNum,  
float score, ScorePositions[] scorePositions) method -- or, more  
likely, a hitCollector.collect(Scorer scorer) method -- that would  
preserve each position that contributed to the score of a document  
and how much it contributed, allowing that information to be passed  
through a Hit object to the Highlighter.

That might be complemented storing the startOffsets and endOffsets  
for each field as streams of delta-encoded VInts along with the  
stored field data.  Conceptually, it would be even cleaner to keep  
startOffsets and endOffsets in the postings...

a. <doc>+

b. <doc, boost>+

c. <doc, freq, <position>+ >+

d. <doc, freq, <position, boost>+ >+

e. <doc, freq, <position, boost, startOffset, endOffset>+ >+

... and pass *everything* the Highlighter needs to the Hit object.   
However, the offsets are never needed for scoring.

> I would also like a way to store the frequency of the term in the  
> overall collection (probably should go in the Term dictionary, but  
> not sure, at the cost of an additional VInt per term, but I am open  
> to other places to store it).  Right now, in order to calculate  
> this, one has to either store it separately at indexing time (using  
> a term counting Filter) or calculate it at runtime by looping over  
> the TermDocs and summing.

Sure, makes sense to me.  Sounds like a custom codec you'd define.   
(The following code has been swiped and adapted from TermBuffer...)

public class CollFreqCodec extends TermDictionaryCodec {
   private collFreq;

   public void readRecord (IndexInput input, FieldInfos fieldInfos)
     throws IOException {
     this.term = null;                           // invalidate cache
     int start = input.readVInt();
     int length = input.readVInt();
     int totalLength = start + length;
     input.readBytes(this.bytes, start, length);
     this.field = fieldInfos.fieldName(input.readVInt());
     this.collFreq = input.readVInt();

That's not quite right, because I'm envisioning a codec rather than a  
TermBuffer subclass, but maybe you get the idea.

Marvin Humphrey
Rectangular Research

