Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <8582539.1175602532530.JavaMail.jira@brutus>
Date: Tue, 3 Apr 2007 05:15:32 -0700 (PDT)
From: "Michael McCandless (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to
 buffer added documents
In-Reply-To: <16648050.1174583194037.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486334 ] 

Michael McCandless commented on LUCENE-843:
-------------------------------------------

Here are the results for "normal" sized docs (1K tokens = ~5,500 bytes plain text each):

  200000 DOCS @ ~5,500 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


    No term vectors nor stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          200000 docs in 397.6 secs
          index size = 415M

        new
          200000 docs in 167.5 secs
          index size = 411M

        Total Docs/sec:             old   503.1; new  1194.1 [  137.3% faster]
        Docs/MB @ flush:            old    81.6; new   406.2 [  397.6% more]
        Avg RAM used (MB) @ flush:  old    87.3; new    35.2 [   59.7% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          200000 docs in 394.6 secs
          index size = 415M

        new
          200000 docs in 168.4 secs
          index size = 408M

        Total Docs/sec:             old   506.9; new  1187.7 [  134.3% faster]
        Docs/MB @ flush:            old    81.6; new   432.2 [  429.4% more]
        Avg RAM used (MB) @ flush:  old   126.6; new    36.9 [   70.8% less]


    With term vectors (positions + offsets) and 2 small stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          200000 docs in 754.2 secs
          index size = 1.7G

        new
          200000 docs in 304.9 secs
          index size = 1.7G

        Total Docs/sec:             old   265.2; new   656.0 [  147.4% faster]
        Docs/MB @ flush:            old    46.7; new   406.2 [  769.6% more]
        Avg RAM used (MB) @ flush:  old    92.9; new    35.2 [   62.1% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          200000 docs in 743.9 secs
          index size = 1.7G

        new
          200000 docs in 244.3 secs
          index size = 1.7G

        Total Docs/sec:             old   268.9; new   818.7 [  204.5% faster]
        Docs/MB @ flush:            old    46.7; new   432.2 [  825.2% more]
        Avg RAM used (MB) @ flush:  old    93.0; new    36.6 [   60.6% less]


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org