lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From siddharth teotia <siddharthteo...@gmail.com>
Subject Re: Memory usage
Date Mon, 11 Nov 2019 19:18:30 GMT
Hi Michael

Can you or someone from the community please help answer my questions?

Thanks
Siddharth

On Thu, Nov 7, 2019 at 7:50 AM siddharth teotia <siddharthteotia@gmail.com>
wrote:

> Hi Michael
>
> Thanks a lot for your response. Couple of more questions
>
> (1) During indexing, is there any knob to tell the writer to use off-heap
> for buffering. I didn't find anything in the docs so probably the answer is
> no. Just confirming..
>
> (2) In my experiments, I have gone upto ingesting 5 million documents into
> the lucene index and the number of segments created was 1. The writer was
> committed and closed after ingesting all the documents and after that there
> is no need for us to index more. So essentially it is an immutable index.
> Basically I wanted to find the threshold for creating a new segment. Is
> that pretty high? Or if the writer is reopened, then the next set of
> documents will go into the next segment and so on? The reason for doing
> this is to find the total number of files (per index) that will be opened
> during querying. So far since it was a single segment, only that segment's
> cfs file was opened.
>
> Thanks
> Siddharth
>
> On Thu, Nov 7, 2019, 6:39 AM Michael McCandless <lucene@mikemccandless.com>
> wrote:
>
>> Hi Siddharth,
>>
>> Your understanding of MMapDirectory is correct -- only give your JVM
>> enough heap to not spend too much CPU on GC, and then let the OS use all
>> available remaining RAM to cache hot pages from your index.
>>
>> There are some structures Lucene loads into JVM heap, but even those are
>> being moved off-heap (accessed via Directory) recently such as FSTs used
>> for the terms index, and BKD index (for dimensional points).  I'm not sure
>> exactly which structures are still in heap ... maybe the live documents
>> bitset?
>>
>> During indexing, the recently indexed documents are buffered in JVM heap,
>> up until the IndexWriterConfig.setRAMBufferSizeMB and then they will be
>> written to the Directory as new segments.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Nov 6, 2019 at 11:27 PM siddharth teotia <
>> siddharthteotia@gmail.com> wrote:
>>
>>> Hi All
>>>
>>> I have some questions about the memory usage. I would really appreciate
>>> if
>>> someone can help answer these.
>>>
>>> I understand from the docs that during reading/querying, Lucene uses
>>> MMapDirectory (assuming it is supported on the platform). So the Java
>>> heap
>>> overhead in this case will purely come from the objects that are
>>> allocated/instantiated on the query path to process the query and build
>>> results etc.  But the whole index itself will not be loaded into memory
>>> because we memory mapped the file. Is my understanding correct? In this
>>> case, we are better off not increasing the Java heap and keep as much
>>> as possible available for the file system cache for mmap to do its job
>>> efficiently.
>>>
>>> However, are there any portions of index structures that are completely
>>> loaded in memory regardless of whether it is MMapDirectory or not? If so,
>>> are they loaded in Java heap or do we use off-heap (direct buffers) in
>>> such cases?
>>>
>>> Secondly, on the write path I think even though the writer opens a
>>> MMapDirectory, the writes are gathered/buffered in memory upto a flush
>>> threshold controlled by IndexWriterConfig. Is this buffering done in Java
>>> heap or direct memory?
>>>
>>> Thanks a lot for help
>>> Siddharth
>>>
>>

-- 
*Best Regards,*
*SIDDHARTH TEOTIA*
*2008C6PS540G*
*BITS PILANI- GOA CAMPUS*

*+91 87911 75932*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message