lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <>
Subject Re: possible TermInfosReader speedup
Date Thu, 09 Apr 2009 06:13:05 GMT
On 4/8/09 2:08 PM, Earwin Burrfoot wrote:
> On Thu, Apr 9, 2009 at 00:14, Michael McCandless
> <>  wrote:
>> On Wed, Apr 8, 2009 at 3:46 PM, Earwin Burrfoot<>  wrote:
>>> Currently, when we're seeking a given Term, it does a binary search
>>> across all term space, including terms belonging to other fields.
>>> I propose augmenting fields file with two pointers (firstTerm,
>>> lastTerm) for each field. That reduces range we need to search, and
>>> instead of comparing Terms we only need to compare values.
>>> How does that sound?
>> That sounds great!  Wanna make a patch?
> Can try. But I'm not at all comfortable with these parts of Lucene,
> will probably need help, at least with tests.

I was thinking about doing this as part of LUCENE-1195. However, I doubt 
that the net win will be very noticeable here. A common scenario is that 
you have an index with one big body field that has a lot of unique 
terms, plus several metafields that contribute less unique terms. Even 
if all metafields together would contribute the same amount of 
additional unique terms as the body field, this proposed change would 
only save one term comparison per body term lookup. The reason is the 
O(log(n)) of the in-memory binary search.

The story is a bit different for looking up terms on the smaller 
metafields. Here you could probably save more term comparisons. But I 
still think the improvement here would at the end be in the noise. I 
mean how long do e.g. 30 in-memory term comparisons take compared to all 
the disk seeks, sequential I/Os, VInts decodings, etc. that every search 
needs to do? And you probably never have more than 2^30 unique terms in 
your index.

So I doubt this improvement will be noticeable, but I would be happy if 
you proved me wrong and this was indeed a long hanging fruit.


View raw message