lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter K <peat...@yahoo.de>
Subject Re: Comparing Indexing Speed of Lucene 3.5 and 4.0
Date Tue, 03 Jan 2012 23:52:55 GMT
Thanks Simon for you answer!

> as far as I can see you are comparing apples and pears.

When excluding the waiting time I also get the slight but reproducable
difference**. The times for waitForGeneration are nearly the same
(~2sec). Also when I commit instead waitForGeneration it is no
difference. Would you mind to give me some more hints/explanations and
I'll try to digg deeper :) !

> Your comparison is waiting for merges to finish and if you are using multiple threads
lucene 4.0 will flush more segments to disk than 3.5

It does not seem to be an 'IO related issue' because using RAMDirectory
results in the same times.
And indexing via Luc4 with only one thread shouldn't be slower than 3.5 (?)


> You should add some more randomness or reality to your test.

Hmmh, ok. The uid and type is the reality in my other (experimental)
project as it uses a generated and incremented id from AtomicLong and
two types.
Or do you have an explanation why luc4 can be slower on such 'simple'
fields?

Could it be due to some garbage collector or thread overhead with luc4?
As I see a bigger execution speed variation for single lucene 4.0 runs
(differences of seconds!) than for 3.5 (differences in 0.1seconds!).
E.g. how could I try to reduce those/some threads?

Regards,
Peter.



**
sw = new StopWatch("perf" + trial).start();
for (int i = 0; i < items; i++) {
    innerRun(trial, i);
}
float indexingTime = sw.stop().getSeconds();


// luc4.0
@Override public void innerRun(int trial, int i) {
    long id = i;
    Document newDoc = new Document();               
    NumericField idField = new NumericField("_id", 6, Field.Store.YES,
true).setLongValue(id);
    Field uIdField = new Field("_uid", "" + id, Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS);
    uIdField.setIndexOptions(IndexOptions.DOCS_ONLY);
    Field typeField = new Field("_type", "test", Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS);
    typeField.setIndexOptions(IndexOptions.DOCS_ONLY);
    newDoc.add(idField);
    newDoc.add(uIdField);
    newDoc.add(typeField);
    try {
        String longStr = NumericUtils.longToPrefixCoded(id);
        latestGen = nrtManager.updateDocument(new Term("_id", longStr),
newDoc);
        docs++;
    } catch (IOException ex) {
        logger.error("Cannot update " + i, ex);
    }
}


// luc3.5
@Override public void innerRun(int trial, int i) {
    long id = i;
    Document newDoc = new
Document();                                               
    NumericField idField = new NumericField("_id", 6,
NumericField.TYPE_STORED).setLongValue(id);
    Field uIdField = new Field("_uid", "" + id, StringField.TYPE_STORED);
    Field typeField = new Field("_type", "test", StringField.TYPE_STORED);
    newDoc.add(idField);
    newDoc.add(uIdField);
    newDoc.add(typeField);
    try {
        // problem when reusing: nrt thread and this thread access the
same bytes at the same time!
        final BytesRef bytes = new BytesRef();
        NumericUtils.longToPrefixCoded(id, 0, bytes);
        latestGen = nrtManager.updateDocument(new Term("_id", bytes),
newDoc);
        docs++;
    } catch (IOException ex) {
        logger.error("Cannot update " + i, ex);
    }
}

> hey Peter,
>
> as far as I can see you are comparing apples and pears. Your
> comparison is waiting for merges to finish and if you are using
> multiple threads lucene 4.0 will flush more segments to disk than 3.5
> so what you are seeing is likely a merge that is still trying to merge
> small segments. can you rerun and only measure the time until the last
> commit finishes (not the close)
>
> one more thing, you are indexing always the more or less same document
> and the text is very very short. You should add some more randomness
> or reality to your test.
>
> simon
>
> On Tue, Jan 3, 2012 at 5:56 PM, Peter K <peathal@yahoo.de> wrote:
>> Hi,
>>
>> I recently switched an experimental project from Lucene 3.5 to 4.0 from
>> 6th Dec 2011
>> and my indexing time increased by nearly 20% on my local machine*.
>> It seems to me that two simple StringField's could cause this slow down:
>> Field uIdField = new Field("_uid", "" + id, StringField.TYPE_STORED);
>> Field typeField = new Field("_type", "test", StringField.TYPE_STORED);
>>
>> Without them Lucene 4 is faster**. Here is a recreation using different
>> branches for every lucene version:
>> https://github.com/karussell/lucene-tmp
>> Or is there something wrong with my too simplistic scenario?
>>
>> Furthermore: How could I further improve Lucene 4.0 indexing speed?
>> (I already read through the performance list on the wiki)
>>
>> Regards,
>> Peter.
>>
>> *
>> open jdk 1.6.0_20  (but also confirmed with latest java6 from oracle)
>> ubuntu/10.10 linux/2.6.35-31 i686, 2GB ram
>>
>> **
>> lucene 3.5
>> 23.5sec index all three fields: _id, _uid, type
>> 19.0sec index only the _id field
>>
>> lucene 4
>> 29.5sec index _id, _uid, type
>> 16.5sec index only the _id
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message