lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: storing term text internally as byte array and bytecount as prefix, etc.
Date Tue, 02 May 2006 02:08:52 GMT
On May 1, 2006, at 6:27 PM, jian chen wrote:

> This way, for indexing new documents, the new Term(String text) is  
> called
> and utf8bytes will be obtained from the input term text. For  
> segment term
> info merge, the utf8bytes will be loaded from the Lucene index, which
> already stores the term text as utf8 bytes. Therefore, no  
> conversion is
> needed.

SegmentMerger will have to change to use bytes if bytecount-based  
string header is going to achieve acceptable performace.  Doug  
pointed that out when I was about to throw in the towel because I  
couldn't get things fast enough.  Changing the implementation of Term  
would have a very broad impact; I'd look for other ways to go about  
it first.  But I'm not an expert on SegmentMerger, as KinoSearch  
doesn't use the same technique for merging.

My plan was to first submit a patch that made the change to the file  
format but didn't touch SegmentMerger, then attack SegmentMerger and  
also see if other developers could suggest optimizations.

However, I have an awful lot on my plate right now, and I basically  
get paid to do KinoSearch-related work, but not Lucene-related work.   
It's hard for me to break out the time to do the java coding,  
especially since I don't have that much experience with java and I'm  
slow.  I'm not sure how soon I'll be able to get back to those  
bytecount patches.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message