lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uwe Schindler <...@thetaphi.de>
Subject Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1
Date Thu, 18 May 2017 11:22:15 GMT
Hi,
Are you sure that the term index is the problem? Even with huge indexes you never need 65
good of heap! That's impossible.
Are you sure that your problem is not something else?:

- too large heap? Heaps greater than 31 gigs are bad by default. Lucene needs only few heap,
although you have large indexes with many terms! You can easily run a query on a 100 Gig index
with less than 4 gigs of heap. The memory used by Lucene is filesystem cache through MMapDirectory,
so you need lots of that free, not heap space. Too large heaps are contraproductive.

- could it's be that you try to sort on one of those fields and you haven't DocValues enabled?
Then it leads everything into Heap and you are in trouble.

FYI, since Lucene 5 you can get the heap usage of many Lucene components using the Accountable
interface. E.G., Just call ramBytesUsed() on your IndexReader. You can also dive into all
components strarting from the IndexReader at top level to see which one is using the heap.
Just get the whole output of the tree as a hierarchical printout using Accountable interface.

We need more information to help you.
Uwe

Am 18. Mai 2017 12:56:14 MESZ schrieb Michael McCandless <lucene@mikemccandless.com>:
>That sounds like a fun amount of terms!
>
>Note that Lucene does not load all terms into memory; only the "prefix
>trie", stored as an FST (
>http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html),
>mapping term prefixes to on-disk blocks of terms.  FSTs are very
>compact
>data structures, effectively implementing SortedMap<String,T>, so it's
>surprising you need 65 G heap for the FSTs.
>
>Anyway, with the BlockTreeTermsWriter/Reader, the equivalent of the old
>termInfosIndexDivisor is to change the allowed on-disk block size
>(defaults
>to 25 - 48 terms per block) to something larger.  To do this, make your
>own
>subclass of FilterCodec, passing the current default codec to wrap, and
>override the postingsFormat method to return a "new
>Lucene50PostingsFormat(...)" passing a larger min and max block size. 
>This
>applies at indexing time, so you need to reindex to see your FSTs get
>smaller.
>
>Mike McCandless
>
>http://blog.mikemccandless.com
>
>On Wed, May 17, 2017 at 5:26 PM, Tom Hirschfeld
><tomhirschfeld@gmail.com>
>wrote:
>
>> Hey!
>>
>> I am working on a lucene based service for reverse geocoding. We have
>a
>> large index with lots of unique terms (550 million) and it appears
>that
>> we're running into issue with memory on our leaf servers as the term
>> dictionary for the entire index is being loaded into heap space. If
>we
>> allocate > 65g heap space, our queries return relatively quickly (10s
>-100s
>> of ms), but if we drop below ~65g heap space on the leaf nodes, query
>time
>> drops dramatically, quickly hitting 20+ seconds (our test harness
>drops at
>> 20s).
>>
>> I did some research, and found in past versions of lucene, one could
>split
>> the loading of the terms dictionary using the 'termInfosIndexDivisor'
>> option in the directoryReader class. That option was deprecated in
>lucene
>> 5.0.0
>>
><https://abi-laboratory.pro/java/tracker/changelog/lucene/5.0.0/log.html>
>> in
>> favor of using codecs to achieve similar functionality. Looking at
>the
>> available experimental codecs. I see the BlockTreeTermsWriter
>> <https://lucene.apache.org/core/5_3_1/core/org/apache/
>> lucene/codecs/blocktree/BlockTreeTermsWriter.html#
>> BlockTreeTermsWriter(org.apache.lucene.index.SegmentWriteState,
>> org.apache.lucene.codecs.PostingsWriterBase, int, int)> that seems
>like it
>> could be used for a similar purpose, breaking down the term
>dictionary so
>> that we don't load the whole thing into heap space.
>>
>> Has anyone run into this problem before and found an effective
>solution?
>> Does changing the codec used seem appropriate for this issue? If so,
>how do
>> I got about loading an alternative codec and configuring it to my
>needs?
>> I'm having trouble finding docs/examples of how this is used in the
>real
>> world so even if you point me to a repo or docs somewhere I'd
>appreciate
>> it.
>> Thanks!
>>
>> Best,
>> Tom Hirschfeld
>>

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de
Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message