lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Date Tue, 03 Apr 2007 10:21:32 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486292
] 

Michael McCandless commented on LUCENE-843:
-------------------------------------------

Some details on how I measure RAM usage: both the baseline (current
lucene trunk) and my patch have two general classes of RAM usage.

The first class, "document processing RAM", is RAM used while
processing a single doc. This RAM is re-used for each document (in the
trunk, it's GC'd and new RAM is allocated; in my patch, I explicitly
re-use these objects) and how large it gets is driven by how big each
document is.

The second class, "indexed documents RAM", is the RAM used up by
previously indexed documents.  This RAM grows with each added
document and how large it gets is driven by the number and size of
docs indexed since the last flush.

So when I say the writer is allowed to use 32 MB of RAM, I'm only
measuring the "indexed documents RAM".  With trunk I do this by
calling ramSizeInBytes(), and with my patch I do the analagous thing
by measuring how many RAM buffers are held up storing previously
indexed documents.

I then define "RAM efficiency" (docs/MB) as how many docs we can hold
in "indexed documents RAM" per MB RAM, at the point that we flush to
disk.  I think this is an important metric because it drives how large
your initial (level 0) segments are.  The larger these segments are
then generally the less merging you need to do, for a given # docs in
the index.

I also measure overall RAM used in the JVM (using
MemoryMXBean.getHeapMemoryUsage().getUsed()) just prior to each flush
except the last, to also capture the "document processing RAM", object
overhead, etc.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch,
LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message