lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Tong <st...@jamasoftware.com>
Subject RE: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
Date Mon, 12 Dec 2011 21:03:31 GMT
Thanks Simon for your response.

I just re-ran the 3.5 benchmark with the ClassicAnalyzer. Here are the results:

     [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
     [java] Operation       round flush mrg   runCnt   recsPerRun        rec/s  elapsedSec
   avgUsedMem    avgTotalMem
     [java] MAddDocs_200000     0 16.00  10        1       200000       715.76      279.42
   48,828,144    128,057,344
     [java] MAddDocs_200000 -   1 16.00  10 -  -   1 -  -  200000 -  -  679.04 -  - 294.53
-  68,321,424 -   85,721,088
     [java] MAddDocs_200000     2 16.00  10        1       200000       761.95      262.49
   63,139,256     91,881,472

The performance is slightly better than the one using StandardAnalyzer,  but  this is still
much worse than the performance with 2.4.1.

Sean

-----Original Message-----
From: Simon Willnauer [mailto:simon.willnauer@googlemail.com] 
Sent: Monday, December 12, 2011 12:20 PM
To: java-user@lucene.apache.org
Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?

hey,

can you try to use the ClassicAnalyzer instead of StandartAnalzyer in
3.5 since in 3.5 the StandartAnalyzer is a different implementation than in 2.9 and 2.4 or
rerun the 2.4 benchmarks with a WhitespaceAnalyzer just for the comparison.

simon

On Mon, Dec 12, 2011 at 7:08 PM, Sean Tong <stong@jamasoftware.com> wrote:
> Looks like the attachment for the algorithm is missing from last email.  I have pasted
the text here. Thanks in advance for any help.
>
> #Start of the wikipedia-default.alg file
>
> merge.factor=mrg:10:10:10
> max.field.length=2147483647
> #max.buffered=buf:10:10:100:100
> ram.flush.mb=flush:16:16:16
>
> compound=true
>
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> directory=FSDirectory
>
> doc.stored=true
> doc.tokenized=true
> doc.term.vector=false
> log.step=5000
>
> docs.file=temp/enwiki-20070527-pages-articles.xml
>
> content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentS
> ource
>
> query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
>
> # task at this depth or less would print when they start
> task.max.depth.log=2
>
> log.queries=false
> # 
> ----------------------------------------------------------------------
> ---------------
>
> { "Rounds"
>
>    ResetSystemErase
>
>    { "Populate"
>        CreateIndex
>        { "MAddDocs" AddDoc > : 200000
>        CloseIndex
>    }
>
>    NewRound
>
> } : 3
>
> RepSumByName
> RepSumByPrefRound MAddDocs
>
> #End of wikipedia-default.alg file
>
> Thanks,
>
> Sean
>
>
> From: Sean Tong [mailto:stong@jamasoftware.com]
> Sent: Sunday, December 11, 2011 11:54 PM
> To: java-user@lucene.apache.org
> Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
>
> Hi,
>
> We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I have
been running  benchmark tests that come with Lucence. To my surprise, I found that the indexing
 in 3.5.0 is significant slower than 2.4.1 for the Wikipedia data.
>
> Attached is the algorithm for the tests.   The tests used default Lucence settings for
flush memory size and merge factor. 512M memory was used  for the tasks.  The test machine
is a 64-bit Windows 7 machine with Intel Core i7.
>
> The command:
> %ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task
>
> Here are the test results:
>
> Lucece 2.4.1
>
>       [java] ------------> Report sum by Prefix (MAddDocs) and Round 
> (3 about 3 out of 14)
>
>     [java] Operation       round flush mrg   runCnt   recsPerRun        
> rec/s  elapsedSec    avgUsedMem    avgTotalMem
>
>     [java] MAddDocs_200000     0 16.00  10        1       200000      
> 1,609.1      124.29    89,218,496    241,631,232
>
>     [java] MAddDocs_200000 -   1 16.00  10 -  -   1 -  -  200000 -  - 
> 1,746.4 -  - 114.52 - 102,365,864 -  241,762,304
>
>     [java] MAddDocs_200000     2 16.00  10        1       200000      
> 1,566.8      127.65    69,428,144    174,194,688
>
>
> Lucene 2.9.4
>
>     [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 
> about 3 out of 14)
>
>     [java] Operation       round flush mrg   runCnt   recsPerRun        
> rec/s  elapsedSec    avgUsedMem    avgTotalMem
>
>     [java] MAddDocs_200000     0 16.00  10        1       200000     
> 1,046.49      191.12    82,676,152    139,657,216
>
>     [java] MAddDocs_200000 -   1 16.00  10 -  -   1 -  -  200000 -   
> 1,165.35 -  - 171.62 - 119,364,128 -  156,762,112
>
>     [java] MAddDocs_200000     2 16.00  10        1       200000     
> 1,245.86      160.53    50,361,760    137,625,600
>
> Lucene 3.5.0
>
>     [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 
> about 3 out of 14)
>
>     [java] Operation       round flush mrg   runCnt   recsPerRun        
> rec/s  elapsedSec    avgUsedMem    avgTotalMem
>
>     [java] MAddDocs_200000     0 16.00  10        1       200000      

> 676.48      295.65    70,917,592    129,695,744
>
>     [java] MAddDocs_200000 -   1 16.00  10 -  -   1 -  -  200000 -  -  
> 626.13 -  - 319.42 -  50,329,552 -   94,240,768
>
>     [java] MAddDocs_200000     2 16.00  10        1       200000      

> 687.68      290.83    57,732,640     92,864,512
>
>
> The indexing speed using 2.4.1 is 2.3x  of the speed using 3.5.0.   Did I miss any
settings or configurations?
>
> Thanks,
>
> Sean
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message