lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hamish Carpenter <ham...@catalyst.net.nz>
Subject Benchmarking Results
Date Fri, 29 Nov 2002 01:41:30 GMT
Hi Everyone,

I've been lurking on this list for a couple of weeks now.  I thought I 
would contribute my experiences (with timings) of using lucene.

The main issue we have is why does performance decrease significantly 
when searching with multiple threads?

I hope this helps people starting out with lucene to compare with their 
performance.

Hamish

BTW: Optimizing took 3.5 minutes at 500,000 documents and 4.7 minutes at 
1,000,000 documents.  Sorry I don't have memory usage figures.

<benchmark>
Hardware environment
--------------------
Dedicated machine for indexing (yes/no): yes
CPU (Type, Speed and Quantity): Intel x86 P4 1.5Ghz
RAM: 512 DDR
Drive configuration (IDE, SCSI, RAID-1, RAID-5): IDE 7200rpm Raid-1

Software environment
--------------------
Java Version: 1.3.1 IBM JITC Enabled
OS Version: Debian Linux 2.4.18-686
Location of index directory (local/network): local

Lucene indexing variables
-------------------------
Number of source documents: Random generator. Set to make 1M documents
in 2x500,000 batches.
Total filesize of source documents: > 1Gb if stored.
Average filesize of source documents (in KB/MB): 1kb
Source documents storage location (filesystem, DB, http,etc): fs
File type of source documents: generated.
Parser(s) used, if any: default
Analyzer(s) used: default
Number of fields per document: 11
Type of fields: 1 date, 1 id, 9 text
Index persistence (FSDirectory, SqlDirectory, etc): FSDirectory

Time taken (in ms/s as an average of at least 3 indexing runs):
Time taken / 1000 docs indexed: 49seconds
Memory consumption: unsure

Notes (any special tuning/strategies):
--------------------------------------
A windows client ran a random document generator which created
documents based on some arrays of values and an excerpt (approx 1kb)
from a text file of the bible (King James version).
These were submitted via a socket connection (open throughout
indexing process).
The index writer was not closed between index calls.
This created a 400Mb index in 23 files (after optimization).

Query details:
--------------
Set up a threaded class to start x number of simultaneous threads to
search the above created index.

Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) (Teaser:goo* Tea
ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) 
+DisplayStartDate:[mkwsw2jk0
-mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]

This query counted 34000 documents and I limited the returned documents
to 5.

This is using Peter Halacsy's IndexSearcherCache slightly modified to
be a singleton returned cached searchers for a given directory. This
solved an initial problem with too many files open and running out of
linux handles for them.

Threads|Avg Time per query (ms)
1       1009ms
2       2043ms
3       3087ms
4       4045ms
.	.
.	.
10      10091ms

I removed the two date range terms from the query and it made a HUGE
difference in performance. With 4 threads the avg time dropped to 900ms!

Other query optimizations made little difference.
</benchmark>



--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message