lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Hirschfeld <tomhirschf...@gmail.com>
Subject Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1
Date Thu, 18 May 2017 00:26:19 GMT
Hey!

I am working on a lucene based service for reverse geocoding. We have a
large index with lots of unique terms (550 million) and it appears that
we're running into issue with memory on our leaf servers as the term
dictionary for the entire index is being loaded into heap space. If we
allocate > 65g heap space, our queries return relatively quickly (10s -100s
of ms), but if we drop below ~65g heap space on the leaf nodes, query time
drops dramatically, quickly hitting 20+ seconds (our test harness drops at
20s).

I did some research, and found in past versions of lucene, one could split
the loading of the terms dictionary using the 'termInfosIndexDivisor'
option in the directoryReader class. That option was deprecated in lucene
5.0.0
<https://abi-laboratory.pro/java/tracker/changelog/lucene/5.0.0/log.html> in
favor of using codecs to achieve similar functionality. Looking at the
available experimental codecs. I see the BlockTreeTermsWriter
<https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/codecs/blocktree/BlockTreeTermsWriter.html#BlockTreeTermsWriter(org.apache.lucene.index.SegmentWriteState,
org.apache.lucene.codecs.PostingsWriterBase, int, int)> that seems like it
could be used for a similar purpose, breaking down the term dictionary so
that we don't load the whole thing into heap space.

Has anyone run into this problem before and found an effective solution?
Does changing the codec used seem appropriate for this issue? If so, how do
I got about loading an alternative codec and configuring it to my needs?
I'm having trouble finding docs/examples of how this is used in the real
world so even if you point me to a repo or docs somewhere I'd appreciate
it.
Thanks!

Best,
Tom Hirschfeld

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message