lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] [Commented] (LUCENE-5248) Improve the data structure used in ReaderAndLiveDocs to hold the updates
Date Wed, 02 Oct 2013 06:10:25 GMT


Shai Erera commented on LUCENE-5248:

Forgot to mention, I tried to use RamUsageEstimator to detect early when the size of IndexWriter.readerPool
continuously grows. But this doesn't work, I think because of the circular reference from
SegReader -> SegCoreReaders -> SegReader:
* Even though I see sizeOf(RALD) drops whenever writeLiveDocs is called (because the map is
cleared), it still grows continuously because of its reader.
* I debugged this and I see SegReader only references the DVProducers it needs (new gen'd
ones) and didn't spot any potential memory leak
* Also the size of SegCoreReaders takes up the majority of SegReader but everything looks
ok w/ SegCoreReaders.
* I think I read somewhere that RUE is not good at measuring circular referencing objects?

If you think that RUE can be used to detect this continuous growth, it means there's a potential
memory leak between RALD, SegReader and SegCoreReaders and I will get to the bottom of it.

> Improve the data structure used in ReaderAndLiveDocs to hold the updates
> ------------------------------------------------------------------------
>                 Key: LUCENE-5248
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: LUCENE-5248.patch, LUCENE-5248.patch
> Currently ReaderAndLiveDocs holds the updates in two structures:
> +Map<String,Map<Integer,Long>>+
> Holds a mapping from each field, to all docs that were updated and their values. This
structure is updated when applyDeletes is called, and needs to satisfy several requirements:
> # Un-ordered writes: if a field "f" is updated by two terms, termA and termB, in that
order, and termA affects doc=100 and termB doc=2, then the updates are applied in that order,
meaning we cannot rely on updates coming in order.
> # Same document may be updated multiple times, either by same term (e.g. several calls
to IW.updateNDV) or by different terms. Last update wins.
> # Sequential read: when writing the updates to the Directory (fieldsConsumer), we iterate
on the docs in-order and for each one check if it's updated and if not, pull its value from
the current DV.
> # A single update may affect several million documents, therefore need to be efficient
w.r.t. memory consumption.
> +Map<Integer,Map<String,Long>>+
> Holds a mapping from a document, to all the fields that it was updated in and the updated
value for each field. This is used by IW.commitMergedDeletes to apply the updates that came
in while the segment was merging. The requirements this structure needs to satisfy are:
> # Access in doc order: this is how commitMergedDeletes works.
> # One-pass: we visit a document once (currently) and so if we can, it's better if we
know all the fields in which it was updated. The updates are applied to the merged ReaderAndLiveDocs
(where they are stored in the first structure mentioned above).
> Comments with proposals will follow next.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message