lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1
Date Thu, 18 May 2017 10:56:14 GMT
That sounds like a fun amount of terms!

Note that Lucene does not load all terms into memory; only the "prefix
trie", stored as an FST (,
mapping term prefixes to on-disk blocks of terms.  FSTs are very compact
data structures, effectively implementing SortedMap<String,T>, so it's
surprising you need 65 G heap for the FSTs.

Anyway, with the BlockTreeTermsWriter/Reader, the equivalent of the old
termInfosIndexDivisor is to change the allowed on-disk block size (defaults
to 25 - 48 terms per block) to something larger.  To do this, make your own
subclass of FilterCodec, passing the current default codec to wrap, and
override the postingsFormat method to return a "new
Lucene50PostingsFormat(...)" passing a larger min and max block size.  This
applies at indexing time, so you need to reindex to see your FSTs get

Mike McCandless

On Wed, May 17, 2017 at 5:26 PM, Tom Hirschfeld <>

> Hey!
> I am working on a lucene based service for reverse geocoding. We have a
> large index with lots of unique terms (550 million) and it appears that
> we're running into issue with memory on our leaf servers as the term
> dictionary for the entire index is being loaded into heap space. If we
> allocate > 65g heap space, our queries return relatively quickly (10s -100s
> of ms), but if we drop below ~65g heap space on the leaf nodes, query time
> drops dramatically, quickly hitting 20+ seconds (our test harness drops at
> 20s).
> I did some research, and found in past versions of lucene, one could split
> the loading of the terms dictionary using the 'termInfosIndexDivisor'
> option in the directoryReader class. That option was deprecated in lucene
> 5.0.0
> <>
> in
> favor of using codecs to achieve similar functionality. Looking at the
> available experimental codecs. I see the BlockTreeTermsWriter
> <
> lucene/codecs/blocktree/BlockTreeTermsWriter.html#
> BlockTreeTermsWriter(org.apache.lucene.index.SegmentWriteState,
> org.apache.lucene.codecs.PostingsWriterBase, int, int)> that seems like it
> could be used for a similar purpose, breaking down the term dictionary so
> that we don't load the whole thing into heap space.
> Has anyone run into this problem before and found an effective solution?
> Does changing the codec used seem appropriate for this issue? If so, how do
> I got about loading an alternative codec and configuring it to my needs?
> I'm having trouble finding docs/examples of how this is used in the real
> world so even if you point me to a repo or docs somewhere I'd appreciate
> it.
> Thanks!
> Best,
> Tom Hirschfeld

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message