lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhang, Lisheng" <Lisheng.Zh...@BroadVision.com>
Subject RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6?
Date Thu, 01 Aug 2013 18:15:38 GMT
Hi Mike,

First I really appreciate your help (for non commercial product)!!

1/ I attached source code of my testing (you see I used StandardAnalyzer), also from CheckIndex
report below
   the unique terms are identical (token counts are slightly different). The stored field
is just ID (1 - 8 
   for each document). The indexed files are from 8 typical files for 8 different languages
(English one is
   "Animal Farm" by George Orwell). Sure I donot mind sending the text files in case you are
interested?

   The query I issued is a trivial one (did not even use filter, like querying "boxer" to
get "Animal Farm")

2/ CheckIndex output:

/// 361:
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr36# java org.apache.lucene.index.CheckIndex
/home/cvsupport/lzhang/index36

NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions
are enabled

Opening index @ /home/cvsupport/lzhang/index36

Segments file=segments_1 numSegments=1 version=3.6.1 format=FORMAT_3_1 [Lucene 3.1+]
  1 of 1: name=_0 docCount=8
    compound=false
    hasProx=true
    numFiles=11
    size (MB)=1.156
    diagnostics = {os.version=3.2.0-49-virtual, os=Linux, lucene.version=3.6.1 1362471 - thetaphi
- 2012-07-17 12:40:12, source=flush, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle
Corporation}
    no deletions
    test: open reader.........OK
    test: fields..............OK [2 fields]
    test: field norms.........OK [1 fields]
    test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335024 tokens]
    test: stored fields.......OK [8 total field count; avg 1 fields per doc]
    test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per
doc]

No problems were detected with this index.

/// 430
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr43# java org.apache.lucene.index.CheckIndex
/home/cvsupport/lzhang/index43

NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions
are enabled

Opening index @ /home/cvsupport/lzhang/index43

Segments file=segments_1 numSegments=1 version=4.3 format=
  1 of 1: name=_0 docCount=8
    codec=Lucene42
    compound=false
    numFiles=13
    size (MB)=1.742
    diagnostics = {timestamp=1375311061843, os=Linux, os.version=3.2.0-49-virtual, source=flush,
lucene.version=4.3.0 1477023 - simonw - 2013-04-29 14:55:14, os.arch=amd64, java.version=1.7.0_25,
java.vendor=Oracle Corporation}
    no deletions
    test: open reader.........OK
    test: fields..............OK [2 fields]
    test: field norms.........OK [1 fields]
    test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335016 tokens]
    test: stored fields.......OK [8 total field count; avg 1 fields per doc]
    test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per
doc]
    test: docvalues...........OK [0 total doc count; 0 docvalues fields]

No problems were detected with this index.



-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Thursday, August 01, 2013 10:45 AM
To: Lucene Users
Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
3.6?


On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
<Lisheng.Zhang@broadvision.com> wrote:
>
> Hi Mike,
>
> I retested and results are the same:
>
> 1/ I did not use sort (so FieldCache should not enter picture?)

No grouping or joining either (they will use FieldCache, if it's not
against a doc values field).

What sort of queries are you running?

> 2/ I created indexed data from scratch separately for 361 and 43
>    based on same text (text files), and I ran test from command
>    line separately against each index folder, so seems a pretty
>    fair test.

OK.

> 3/ Each test I created searcher from scrath (to measure creation
>    time). I did not include JVM start time in each case. The
>    tests are in same box.

OK.

> From indexed data it seems that 43 generated a lot more data in
> folder, below I listed (ls -ltr) result

This is very odd: the 4.3 index is quite a bit larger than the 3.x
index.  Are you certain the two indexed the same content in the same
way?  Which analyzer are you using?  Maybe run CheckIndex against each
index and post the output?

> (always pass in LUCENE_43
> version, so lucen 42 codec should be used, why lucene41?).

This is fine: the Lucene42 codec uses Lucene41PostingsFormat.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message