lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Tamm (JIRA)" <>
Subject [jira] Created: (LUCENE-506) Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you know the queries ahead of time)
Date Thu, 02 Mar 2006 00:10:40 GMT
Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you know the queries
ahead of time)

         Key: LUCENE-506
     Project: Lucene - Java
        Type: Improvement
  Components: Index  
    Versions: 2.0    
 Environment: Patch against Lucene 1.9 trunk as of Mar 1 06
    Reporter: Steven Tamm

Summary: Provide a way to avoid loading the TermInfoIndex into memory if you know all the
terms you are ever going to query.

In our search environment, we have a large number of indexes (many thousands), any of which
may be queried by any number of hosts.  These indexes may be very large (~1M document), but
since we have a low term/doc ratio, we have 7-11M terms.  With an index interval of 128, that
means ~70-90K terms.  On loading the index, it instantiates a Term, a TermInfo, a String,
and a char[].  When the document is long lived, this makes some sense because you can quickly
search the list of terms using binary search.  However, since we throw away the Indexes very
often, a lot of garbage is created per query

Here's an example where we load a large index 10 times.  This corresponds to 7MB of garbage
per query.
          percent          live          alloc'ed  stack class
 rank   self  accum     bytes objs     bytes  objs trace name
    1  4.48%  4.48%   4678736 128946  23393680 644730 387749 char[]
    3  3.95% 12.61%   4126272 128946  20631360 644730 387751 org.apache.lucene.index.TermInfo
    6  2.96% 22.71%   3094704 128946  15473520 644730 387748 java.lang.String
    8  1.98% 26.97%   2063136 128946  10315680 644730 387750 org.apache.lucene.index.Term

This adds up after a while.  Since we know exactly which Terms we're going to search for before
even opening the index, there's no need to allocate this much memory.  Upon opening the index,
we can go through the TII in sequential order and retrieve the entries into the main term
dictionary and reduce the storage requirements dramatically.  This reduces the amount of garbage
generated by querying by about 60% if you only make 1 query/index with a 77% increase in throughput.

This is accomplished by factoring out the "index loading" aspects of TermInfosReader into
a new file, SegmentTermInfosReader.  TermInfosReader becomes a base class to allow access
to terms.  A new class, PrefetchedTermInfosReader will, upon startup, sort the passed in terms
and retrieve the IndexEntries for those terms.  IndexReader and SegmentReader are modified
to take new constructor methods that take a Collection of Terms that correspond to the total
set of terms that will ever be searched in the life of the index.

In order to support the "skipping" behavior, some changes need to be made to SegmentTermEnum:
specifically, we need to be able to go back an entry in order to retrieve the previous TermInfo
and IndexPointer.  This is because, unlike the normal case, with the index  we want to return
the value right before the intended field (so that we can be behind the desired termin the
main dictionary).   For example, if we're looking for  "apple" in the index,  and the two
adjacent values are "abba" and "argon", we want to return "abba" instead of "argon".  That
way we won't miss any terms in the real index.   This code is confusing; it should probably
be moved to an subclass of TermBuffer, but that required more code.  Not wanting to modify
TermBuffer to keep it small, also lead to the odd NPE catch in  Stickler
for contracts may want to rename SegmentTermEnum.skipTo() to a different name because it implements
a different contract: but it would be useful for anyone trying to skip around in the TII,
so I figured it was the right thing to do.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message