hey,
so what I wonder in general is if the benchmarks are comparable. What
I mean is that the benchmark code has changed since 2.4 a lot so there
might be additional fields and / or different settings on what to
index and how.
could you check with luke if the index has the same fields and if the
settings are the same / similar and report it back? I also wonder if
it maybe now uses update instead of add ie. buffers and applies
deletes etc.
simon
On Mon, Dec 12, 2011 at 10:03 PM, Sean Tong <stong@jamasoftware.com> wrote:
> Thanks Simon for your response.
>
> I just re-ran the 3.5 benchmark with the ClassicAnalyzer. Here are the results:
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out
of 14)
> [java] Operation round flush mrg runCnt recsPerRun rec/s
elapsedSec avgUsedMem avgTotalMem
> [java] MAddDocs_200000 0 16.00 10 1 200000
715.76 279.42 48,828,144 128,057,344
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 679.04
- - 294.53 - 68,321,424 - 85,721,088
> [java] MAddDocs_200000 2 16.00 10 1 200000
761.95 262.49 63,139,256 91,881,472
>
> The performance is slightly better than the one using StandardAnalyzer, but this
is still much worse than the performance with 2.4.1.
>
> Sean
>
> -----Original Message-----
> From: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
> Sent: Monday, December 12, 2011 12:20 PM
> To: java-user@lucene.apache.org
> Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
>
> hey,
>
> can you try to use the ClassicAnalyzer instead of StandartAnalzyer in
> 3.5 since in 3.5 the StandartAnalyzer is a different implementation than in 2.9 and 2.4
or rerun the 2.4 benchmarks with a WhitespaceAnalyzer just for the comparison.
>
> simon
>
> On Mon, Dec 12, 2011 at 7:08 PM, Sean Tong <stong@jamasoftware.com> wrote:
>> Looks like the attachment for the algorithm is missing from last email. I have
pasted the text here. Thanks in advance for any help.
>>
>> #Start of the wikipedia-default.alg file
>>
>> merge.factor=mrg:10:10:10
>> max.field.length=2147483647
>> #max.buffered=buf:10:10:100:100
>> ram.flush.mb=flush:16:16:16
>>
>> compound=true
>>
>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>> directory=FSDirectory
>>
>> doc.stored=true
>> doc.tokenized=true
>> doc.term.vector=false
>> log.step=5000
>>
>> docs.file=temp/enwiki-20070527-pages-articles.xml
>>
>> content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentS
>> ource
>>
>> query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
>>
>> # task at this depth or less would print when they start
>> task.max.depth.log=2
>>
>> log.queries=false
>> #
>> ----------------------------------------------------------------------
>> ---------------
>>
>> { "Rounds"
>>
>> ResetSystemErase
>>
>> { "Populate"
>> CreateIndex
>> { "MAddDocs" AddDoc > : 200000
>> CloseIndex
>> }
>>
>> NewRound
>>
>> } : 3
>>
>> RepSumByName
>> RepSumByPrefRound MAddDocs
>>
>> #End of wikipedia-default.alg file
>>
>> Thanks,
>>
>> Sean
>>
>>
>> From: Sean Tong [mailto:stong@jamasoftware.com]
>> Sent: Sunday, December 11, 2011 11:54 PM
>> To: java-user@lucene.apache.org
>> Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
>>
>> Hi,
>>
>> We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I have
been running benchmark tests that come with Lucence. To my surprise, I found that the indexing
in 3.5.0 is significant slower than 2.4.1 for the Wikipedia data.
>>
>> Attached is the algorithm for the tests. The tests used default Lucence settings
for flush memory size and merge factor. 512M memory was used for the tasks. The test machine
is a 64-bit Windows 7 machine with Intel Core i7.
>>
>> The command:
>> %ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task
>>
>> Here are the test results:
>>
>> Lucece 2.4.1
>>
>> [java] ------------> Report sum by Prefix (MAddDocs) and Round
>> (3 about 3 out of 14)
>>
>> [java] Operation round flush mrg runCnt recsPerRun
>> rec/s elapsedSec avgUsedMem avgTotalMem
>>
>> [java] MAddDocs_200000 0 16.00 10 1 200000
>> 1,609.1 124.29 89,218,496 241,631,232
>>
>> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
>> 1,746.4 - - 114.52 - 102,365,864 - 241,762,304
>>
>> [java] MAddDocs_200000 2 16.00 10 1 200000
>> 1,566.8 127.65 69,428,144 174,194,688
>>
>>
>> Lucene 2.9.4
>>
>> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
>> about 3 out of 14)
>>
>> [java] Operation round flush mrg runCnt recsPerRun
>> rec/s elapsedSec avgUsedMem avgTotalMem
>>
>> [java] MAddDocs_200000 0 16.00 10 1 200000
>> 1,046.49 191.12 82,676,152 139,657,216
>>
>> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 -
>> 1,165.35 - - 171.62 - 119,364,128 - 156,762,112
>>
>> [java] MAddDocs_200000 2 16.00 10 1 200000
>> 1,245.86 160.53 50,361,760 137,625,600
>>
>> Lucene 3.5.0
>>
>> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
>> about 3 out of 14)
>>
>> [java] Operation round flush mrg runCnt recsPerRun
>> rec/s elapsedSec avgUsedMem avgTotalMem
>>
>> [java] MAddDocs_200000 0 16.00 10 1 200000
>> 676.48 295.65 70,917,592 129,695,744
>>
>> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
>> 626.13 - - 319.42 - 50,329,552 - 94,240,768
>>
>> [java] MAddDocs_200000 2 16.00 10 1 200000
>> 687.68 290.83 57,732,640 92,864,512
>>
>>
>> The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I miss
any settings or configurations?
>>
>> Thanks,
>>
>> Sean
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|