lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1
Date Thu, 18 May 2017 10:56:14 GMT
That sounds like a fun amount of terms!

Note that Lucene does not load all terms into memory; only the "prefix
trie", stored as an FST (
http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html),
mapping term prefixes to on-disk blocks of terms.  FSTs are very compact
data structures, effectively implementing SortedMap<String,T>, so it's
surprising you need 65 G heap for the FSTs.

Anyway, with the BlockTreeTermsWriter/Reader, the equivalent of the old
termInfosIndexDivisor is to change the allowed on-disk block size (defaults
to 25 - 48 terms per block) to something larger.  To do this, make your own
subclass of FilterCodec, passing the current default codec to wrap, and
override the postingsFormat method to return a "new
Lucene50PostingsFormat(...)" passing a larger min and max block size.  This
applies at indexing time, so you need to reindex to see your FSTs get
smaller.

Mike McCandless

http://blog.mikemccandless.com

On Wed, May 17, 2017 at 5:26 PM, Tom Hirschfeld <tomhirschfeld@gmail.com>
wrote:

> Hey!
>
> I am working on a lucene based service for reverse geocoding. We have a
> large index with lots of unique terms (550 million) and it appears that
> we're running into issue with memory on our leaf servers as the term
> dictionary for the entire index is being loaded into heap space. If we
> allocate > 65g heap space, our queries return relatively quickly (10s -100s
> of ms), but if we drop below ~65g heap space on the leaf nodes, query time
> drops dramatically, quickly hitting 20+ seconds (our test harness drops at
> 20s).
>
> I did some research, and found in past versions of lucene, one could split
> the loading of the terms dictionary using the 'termInfosIndexDivisor'
> option in the directoryReader class. That option was deprecated in lucene
> 5.0.0
> <https://abi-laboratory.pro/java/tracker/changelog/lucene/5.0.0/log.html>
> in
> favor of using codecs to achieve similar functionality. Looking at the
> available experimental codecs. I see the BlockTreeTermsWriter
> <https://lucene.apache.org/core/5_3_1/core/org/apache/
> lucene/codecs/blocktree/BlockTreeTermsWriter.html#
> BlockTreeTermsWriter(org.apache.lucene.index.SegmentWriteState,
> org.apache.lucene.codecs.PostingsWriterBase, int, int)> that seems like it
> could be used for a similar purpose, breaking down the term dictionary so
> that we don't load the whole thing into heap space.
>
> Has anyone run into this problem before and found an effective solution?
> Does changing the codec used seem appropriate for this issue? If so, how do
> I got about loading an alternative codec and configuring it to my needs?
> I'm having trouble finding docs/examples of how this is used in the real
> world so even if you point me to a repo or docs somewhere I'd appreciate
> it.
> Thanks!
>
> Best,
> Tom Hirschfeld
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message