lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Rich positions (was "boosting fields")
Date Thu, 27 Apr 2006 19:17:52 GMT
Marvin Humphrey wrote:
> Moving away from cached norms was the second of three major changes  to 
> the file format on my agenda, and the one I was all but certain I  
> wouldn't be able to sell to the Lucene community.  The first was  using 
> bytecounts at the head of Strings.
> 
> The third was storing start offsets and end offsets in the ProxFile.   
> It rankles that much of the information from tis/frq/prx gets  
> duplicated in the term vector files, but highlighting is most  efficient 
> when you know the offsets, and the primary index stops  short of storing 
> that information.  Currently, we have this:
> 
>     ProxFile (.prx) -->  <TermPositions>TermCount
> 
> How about this?
> 
>     ProxFile (.prx) -->  <TermPositions,TermOffsets>TermCount

This would at least double the size of the .prx file, the largest file 
in Lucene's index.  Yes it's useful, not not all folks will use it.  So 
not all folks should have to pay for it.  One way is to try to make it 
arbitrarily extensible, but to some degree, that's going to end up being 
language-specific.

So perhaps instead we should simply allocate more bits in the FieldInfo. 
  We could allocate bits for WEIGHT_PER_POSITION, OFFSETS_IN_PRX, 
NORMS_IN_FRQ, OMIT_PRX, OMIT_FREQ, etc.  We can increase the number of 
bits there by turning this into a VInt, which would be back-compatible, no?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message