lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Created: (LUCENE-1283) Factor out ByteSliceWriter from DocumentsWriterFieldData
Date Sat, 10 May 2008 10:35:55 GMT
Factor out ByteSliceWriter from DocumentsWriterFieldData
--------------------------------------------------------

                 Key: LUCENE-1283
                 URL: https://issues.apache.org/jira/browse/LUCENE-1283
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
    Affects Versions: 2.3.1, 2.3
            Reporter: Michael McCandless
            Assignee: Michael McCandless
            Priority: Minor
             Fix For: 2.4
         Attachments: LUCENE-1283.patch

DocumentsWriter uses byte slices into shared byte[]'s to hold the
growing postings data for many different terms in memory.  This is
probably the trickiest (most confusing) part of DocumentsWriter.

Right now it's not cleanly factored out and not easy to separately
test.  In working on this issue:

  http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/%3c126142c0805061426n1168421ya5594ef854fae5e4@mail.gmail.com%3e

which eventually turned out to be a bug in Oracle JRE's JIT compiler,
I factored out ByteSliceWriter and created a unit test to stress test
the writing & reading of byte slices.  The test just randomly writes N
streams interleaved into shared byte[]'s, then reads them back
verifying the results are correct.

I created the stress test to try to find any bugs in that code.  The
test ran fine (no bugs were found) but I think the refactoring is
still very much worthwhile.

I expected the changes to reduce indexing throughput, so I ran a test
indexing first 200K Wikipedia docs using this alg:

{code}
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker

docs.file=/Volumes/External/lucene/wiki.txt
doc.stored = true
doc.term.vector = true
doc.add.log.step=2000

directory=FSDirectory
autocommit=false
compound=true

ram.flush.mb=256

{ "Rounds"
  ResetSystemErase
  { "BuildIndex"
    - CreateIndex
     { "AddDocs" AddDoc > : 200000
    - CloseIndex
  }
  NewRound
} : 4

RepSumByPrefRound BuildIndex

{code}

Ok trunk it produces these results:
{code}
Operation   round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
BuildIndex      0        1       200000        791.7      252.63   338,552,096  1,061,814,272
BuildIndex -  - 1 -  -   1 -  -  200000 -  -   793.1 -  - 252.18 - 605,262,080  1,061,814,272
BuildIndex      2        1       200000        794.8      251.63   601,966,528  1,061,814,272
BuildIndex -  - 3 -  -   1 -  -  200000 -  -   782.5 -  - 255.58 - 608,699,712  1,061,814,272
{code}

and with the patch:

{code}
Operation   round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
BuildIndex      0        1       200000        745.0      268.47   338,318,784  1,061,814,272
BuildIndex -  - 1 -  -   1 -  -  200000 -  -   792.7 -  - 252.30 - 605,331,776  1,061,814,272
BuildIndex      2        1       200000        786.7      254.24   602,915,712  1,061,814,272
BuildIndex -  - 3 -  -   1 -  -  200000 -  -   795.3 -  - 251.48 - 602,378,624  1,061,814,272
{code}

So it looks like the performance cost of this change is negligible (in
the noise).



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message