lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Date Mon, 02 Apr 2007 14:43:32 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.take4.patch

Another rev of the patch.  All tests pass except disk full tests.  The
code is still rather "dirty" and not well commented.

I think I'm close to finishing optimizing and now I will focus on
error handling (eg disk full), adding some deeper unit tests, more
testing on corner cases like massive docs or docs with massive terms,
etc., flushing pending norms to disk, cleaning up / commenting the
code and various other smaller items.

Here are the changes in this rev:

  * A proposed backwards compatible change to the Token API to also
    allow the term text to be delivered as a slice (offset & length)
    into a char[] array instead of String.  With an analyzer/tokenizer
    that takes advantage of this, this was a decent performance gain
    in my local testing.  I've created a SimpleSpaceAnalyzer that only
    splits words at the space character to test this.

  * Added more asserts (run java -ea to enable asserts).  The asserts
    are quite useful and now often catch a bug I've introduced before
    the unit tests do.

  * Changed to custom int[] block buffering for postings to store
    freq, prox's and offsets.  With this buffering we no longer have
    to double the size of int[] arrays while adding positions, nor do
    we have to copy ints whenever we needs more space for these
    arrays.  Instead I allocate larger slices out of the shared int[]
    arrays.  This reduces memory and improves performance.

  * Changed to custom char[] block buffering for postings to store
    term text.  This also reduces memory and improves performance.

  * Changed to single file for RAM & flushed partial segments (was 3
    separate files before)

  * Changed how I merge flushed partial segments to match what's
    described in LUCENE-854

  * Reduced memory usage when indexing large docs (25 MB plain text
    each).  I'm still consuming more RAM in this case than the
    baseline (trunk) so I'm still working on this one ...

  * Fixed a slow memory leak when building large (20+ GB) indices



> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch,
LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message