lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: possible segment merge improvement?
Date Thu, 01 Nov 2007 16:46:07 GMT

On Nov 1, 2007, at 3:04 AM, Michael McCandless wrote:

> In KinoSearch, merging of stored fields & term vectors is always a
> fast concatenation of the entry for that document, whereas Lucene must
> re-interpret/re-number all fields on the doc, in general.  In fact I
> think that KinoSearch stores field names directly in the index (ie,
> not numbers).

Yes, that's right.  <http://xrl.us/73dx> (Link to mail- 
archives.apache.org)

Ferret and KS had both previously implemented Robert's suggested mod,  
where no remaps take place if field numbers can be matched up.  KS  
also expended extra effort to keep field numbers consistent (and I  
think Ferret did too) -- but the possibility that we would have to  
remap couldn't ever be eliminated.

Going with field names rather than numbers allowed KS to eliminate a  
big chunk of code.  For the price of a small increase in index size,  
the segment merging process for stored fields and term vectors got  
much simpler.  No more parsing, no more remapping -- it became  
possible to read the record naively as one chunk and copy it, no  
matter what.

If Lucene were to go this route, my suggestion would be to start a  
new subclass of FieldsWriter that uses different index extensions.   
(KS uses .ds and .dsx: "Document Storage".)  Individual  
SegmentReaders can then decide which subclass to use based on which  
files are detected.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message