lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Engels" <>
Subject RE: caching term information?
Date Mon, 22 May 2006 22:21:39 GMT
Seems Doug is correct. I ran our tests through the profiler. Most of the
time is spent in reading/parsing SegmentTermDocs (see the very interesting
attached profiler output).

I was amazed at how much time is spent in both readVint and readByte().
Seems high, but I think it is mainly due to the number of invocations.

1) What if BufferedIndexInput had an optimized version of readVint that used
the buffer and manipulated the position directly?

2) Instead of caching the TermInfo, what if the TermDocs were cached (again
for the top 20% terms). The memory requirement would be much greater, but
you could also say "do not cache the TermDocs that had more than X
documents". The optimized searcher already converts TermQueries similar to
this to a Filter anyway.

-----Original Message-----
From: Doug Cutting []
Sent: Monday, May 22, 2006 11:33 AM
Subject: Re: caching term information?

Marvin Humphrey wrote:
> On May 20, 2006, at 12:01 AM, Robert Engels wrote:
>> Maybe don't cache the term pages, then, just cache the frequently
>> requested
>> terms themselves.
> That sounds like a winner.  Search term frequencies follow a power  law
> distribution.  Cache the top 20% or so in an LRU and you'll cut  down on
> disk seeks and linear scanning significantly.

Keep in mind that the .tis file is compressed: it uses far less memory
per term than a TermInfo does.  So, to minimize disk i/o, one should
leave things compressed and cache portions of the .tis file instead.
The OS's buffer cache should do this well for you.  But if the system
call overhead is causing significant delay, then the .tis file could be
memory mapped.  And if constructing and scanning TermInfos is the
primary delay, then, of course, a cache of TermInfo's might be
indicated.  In summary, there are lots of possible places to optimize
here, but it's not clear which, if any, are warranted.

Folks have benchmarked a TermInfo cache before and not found it
advantagous.  But perhaps your uses are sufficiently different that this
is no longer the case.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message