Looks like the attachment for the algorithm is missing from last email. I have pasted the
text here. Thanks in advance for any help.
#Start of the wikipedia-default.alg file
merge.factor=mrg:10:10:10
max.field.length=2147483647
#max.buffered=buf:10:10:100:100
ram.flush.mb=flush:16:16:16
compound=true
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
doc.stored=true
doc.tokenized=true
doc.term.vector=false
log.step=5000
docs.file=temp/enwiki-20070527-pages-articles.xml
content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource
query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
# task at this depth or less would print when they start
task.max.depth.log=2
log.queries=false
# -------------------------------------------------------------------------------------
{ "Rounds"
ResetSystemErase
{ "Populate"
CreateIndex
{ "MAddDocs" AddDoc > : 200000
CloseIndex
}
NewRound
} : 3
RepSumByName
RepSumByPrefRound MAddDocs
#End of wikipedia-default.alg file
Thanks,
Sean
From: Sean Tong [mailto:stong@jamasoftware.com]
Sent: Sunday, December 11, 2011 11:54 PM
To: java-user@lucene.apache.org
Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
Hi,
We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I have been
running benchmark tests that come with Lucence. To my surprise, I found that the indexing
in 3.5.0 is significant slower than 2.4.1 for the Wikipedia data.
Attached is the algorithm for the tests. The tests used default Lucence settings for flush
memory size and merge factor. 512M memory was used for the tasks. The test machine is a
64-bit Windows 7 machine with Intel Core i7.
The command:
%ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task
Here are the test results:
Lucece 2.4.1
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of
14)
[java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec
avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 1,609.1 124.29
89,218,496 241,631,232
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 1,746.4 - - 114.52
- 102,365,864 - 241,762,304
[java] MAddDocs_200000 2 16.00 10 1 200000 1,566.8 127.65
69,428,144 174,194,688
Lucene 2.9.4
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec
avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 1,046.49 191.12
82,676,152 139,657,216
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - 1,165.35 - - 171.62
- 119,364,128 - 156,762,112
[java] MAddDocs_200000 2 16.00 10 1 200000 1,245.86 160.53
50,361,760 137,625,600
Lucene 3.5.0
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec
avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 676.48 295.65
70,917,592 129,695,744
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 626.13 - - 319.42
- 50,329,552 - 94,240,768
[java] MAddDocs_200000 2 16.00 10 1 200000 687.68 290.83
57,732,640 92,864,512
The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I miss any settings
or configurations?
Thanks,
Sean
|