lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Hirschfeld <>
Subject Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1
Date Wed, 14 Jun 2017 03:12:29 GMT
Hey All,

I was able to solve my problem a few weeks ago and wanted to update you
all. The root issue was with the caching mechanism in
"makedistancevaluesource" method in the lucene spatial module, it appears
that documents were being pulled into the cache and not expired. To address
this issue, we upgraded our application to lucene 6.5.1 and used the
latLonDocValuesField for indexing/searching. Heap use is back down to
~500mb for the whole app under load, and the node can support about 5k qps
@ p95 9ms, which is a great improvement from the RPT strategy we had been
using. Once again, thanks for your help.

Tom Hirschfeld

On Thu, May 18, 2017 at 4:22 AM, Uwe Schindler <> wrote:

> Hi,
> Are you sure that the term index is the problem? Even with huge indexes
> you never need 65 good of heap! That's impossible.
> Are you sure that your problem is not something else?:
> - too large heap? Heaps greater than 31 gigs are bad by default. Lucene
> needs only few heap, although you have large indexes with many terms! You
> can easily run a query on a 100 Gig index with less than 4 gigs of heap.
> The memory used by Lucene is filesystem cache through MMapDirectory, so you
> need lots of that free, not heap space. Too large heaps are
> contraproductive.
> - could it's be that you try to sort on one of those fields and you
> haven't DocValues enabled? Then it leads everything into Heap and you are
> in trouble.
> FYI, since Lucene 5 you can get the heap usage of many Lucene components
> using the Accountable interface. E.G., Just call ramBytesUsed() on your
> IndexReader. You can also dive into all components strarting from the
> IndexReader at top level to see which one is using the heap. Just get the
> whole output of the tree as a hierarchical printout using Accountable
> interface.
> We need more information to help you.
> Uwe
> Am 18. Mai 2017 12:56:14 MESZ schrieb Michael McCandless <
>> That sounds like a fun amount of terms!
>> Note that Lucene does not load all terms into memory; only the "prefix
>> trie", stored as an FST (
>> mapping term prefixes to on-disk blocks of terms.  FSTs are very compact
>> data structures, effectively implementing SortedMap<String,T>, so it's
>> surprising you need 65 G heap for the FSTs.
>> Anyway, with the BlockTreeTermsWriter/Reader, the equivalent of the old
>> termInfosIndexDivisor is to change the allowed on-disk block size (defaults
>> to 25 - 48 terms per block) to something larger.  To do this, make your own
>> subclass of FilterCodec, passing the current default codec to wrap, and
>> override the postingsFormat method to return a "new
>> Lucene50PostingsFormat(...)" passing a larger min and max block size.  This
>> applies at indexing time, so you need to reindex to see your FSTs get
>> smaller.
>> Mike McCandless
>> On Wed, May 17, 2017 at 5:26 PM, Tom Hirschfeld <>
>> wrote:
>>  Hey!
>>>  I am working on a lucene based service for reverse geocoding. We have a
>>>  large index with lots of unique terms (550 million) and it appears that
>>>  we're running into issue with memory on our leaf servers as the term
>>>  dictionary for the entire index is being loaded into heap space. If we
>>>  allocate > 65g heap space, our queries return relatively quickly (10s -100s
>>>  of ms), but if we drop below ~65g heap space on the leaf nodes, query time
>>>  drops dramatically, quickly hitting 20+ seconds (our test harness drops at
>>>  20s).
>>>  I did some research, and found in past versions of lucene, one could split
>>>  the loading of the terms dictionary using the 'termInfosIndexDivisor'
>>>  option in the directoryReader class. That option was deprecated in lucene
>>>  5.0.0
>>>  <>
>>>  in
>>>  favor of using codecs to achieve similar functionality. Looking at the
>>>  available experimental codecs. I see the BlockTreeTermsWriter
>>>  <
>>>  lucene/codecs/blocktree/BlockTreeTermsWriter.html#
>>>  BlockTreeTermsWriter(org.apache.lucene.index.SegmentWriteState,
>>>  org.apache.lucene.codecs.PostingsWriterBase, int, int)> that seems like it
>>>  could be used for a similar purpose, breaking down the term dictionary so
>>>  that we don't load the whole thing into heap space.
>>>  Has anyone run into this problem before and found an effective solution?
>>>  Does changing the codec used seem appropriate for this issue? If so, how do
>>>  I got about loading an alternative codec and configuring it to my needs?
>>>  I'm having trouble finding docs/examples of how this is used in the real
>>>  world so even if you point me to a repo or docs somewhere I'd appreciate
>>>  it.
>>>  Thanks!
>>>  Best,
>>>  Tom Hirschfeld
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message