lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Tong <st...@jamasoftware.com>
Subject RE: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
Date Mon, 12 Dec 2011 18:08:57 GMT
Looks like the attachment for the algorithm is missing from last email.  I have pasted the
text here. Thanks in advance for any help.

#Start of the wikipedia-default.alg file

merge.factor=mrg:10:10:10
max.field.length=2147483647
#max.buffered=buf:10:10:100:100
ram.flush.mb=flush:16:16:16

compound=true

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory

doc.stored=true
doc.tokenized=true
doc.term.vector=false
log.step=5000

docs.file=temp/enwiki-20070527-pages-articles.xml

content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource

query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker

# task at this depth or less would print when they start
task.max.depth.log=2

log.queries=false
# -------------------------------------------------------------------------------------

{ "Rounds"

    ResetSystemErase

    { "Populate"
        CreateIndex
        { "MAddDocs" AddDoc > : 200000
        CloseIndex
    }

    NewRound

} : 3

RepSumByName
RepSumByPrefRound MAddDocs

#End of wikipedia-default.alg file

Thanks,

Sean


From: Sean Tong [mailto:stong@jamasoftware.com]
Sent: Sunday, December 11, 2011 11:54 PM
To: java-user@lucene.apache.org
Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?

Hi,

We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I have been
running  benchmark tests that come with Lucence. To my surprise, I found that the indexing
 in 3.5.0 is significant slower than 2.4.1 for the Wikipedia data.

Attached is the algorithm for the tests.   The tests used default Lucence settings for flush
memory size and merge factor. 512M memory was used  for the tasks.  The test machine is a
64-bit Windows 7 machine with Intel Core i7.

The command:
%ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task

Here are the test results:

Lucece 2.4.1

       [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of
14)

     [java] Operation       round flush mrg   runCnt   recsPerRun        rec/s  elapsedSec
   avgUsedMem    avgTotalMem

     [java] MAddDocs_200000     0 16.00  10        1       200000      1,609.1      124.29
   89,218,496    241,631,232

     [java] MAddDocs_200000 -   1 16.00  10 -  -   1 -  -  200000 -  - 1,746.4 -  - 114.52
- 102,365,864 -  241,762,304

     [java] MAddDocs_200000     2 16.00  10        1       200000      1,566.8      127.65
   69,428,144    174,194,688


Lucene 2.9.4

     [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)

     [java] Operation       round flush mrg   runCnt   recsPerRun        rec/s  elapsedSec
   avgUsedMem    avgTotalMem

     [java] MAddDocs_200000     0 16.00  10        1       200000     1,046.49      191.12
   82,676,152    139,657,216

     [java] MAddDocs_200000 -   1 16.00  10 -  -   1 -  -  200000 -   1,165.35 -  - 171.62
- 119,364,128 -  156,762,112

     [java] MAddDocs_200000     2 16.00  10        1       200000     1,245.86      160.53
   50,361,760    137,625,600

Lucene 3.5.0

     [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)

     [java] Operation       round flush mrg   runCnt   recsPerRun        rec/s  elapsedSec
   avgUsedMem    avgTotalMem

     [java] MAddDocs_200000     0 16.00  10        1       200000       676.48      295.65
   70,917,592    129,695,744

     [java] MAddDocs_200000 -   1 16.00  10 -  -   1 -  -  200000 -  -  626.13 -  - 319.42
-  50,329,552 -   94,240,768

     [java] MAddDocs_200000     2 16.00  10        1       200000       687.68      290.83
   57,732,640     92,864,512


The indexing speed using 2.4.1 is 2.3x  of the speed using 3.5.0.   Did I miss any settings
or configurations?

Thanks,

Sean



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message