lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <>
Subject RE: Performance: compound vs. multi-file index, indexing and searching
Date Sun, 13 Jun 2004 18:16:24 GMT
Hui, thanks for running the numbers and providing this to the list!

The results are very interesting. Of course, the primary purpose of the 
compound file format is to reduce the use of filehandles (at the expense 
of performance, and especially during indexing). I put your data into 
Excel to try to summarize it. Here's what I found:

Traditional, min merge docs 10:  1458110
Compound, min merge docs 10:   1619580 (+11%)
Traditional, min merge docs 100:   1321379 (-9%)

I wonder what the compound at 100 would have looked like?

Indexes  Trad 10  Comp 10  Trad 100
0        37.8        46.8      40.6
1        12.4        15.6      12.6
2        28.4        31.4      37.6
3        18.6        21.8      31.2
4        31.0        28.4      28.4
5        21.8        22        34.4
6        31.0        25.2      31.4
7        37.4        28.6      34.4
8        37.2        34.4      34.2
9        37.4        34.4      34.8
10       40.6        50.4      40.8
11       56.2        40.6      47
12       59.6        43.4
13       43.8        46.8
14       40.6        34.6
15       47.0        37.4
16       37.4 
17       50.2 
18       37.4 
19       41 

I'm also attaching a graph of this data, but not sure if it will come 
through on the mailing list. I think there is too much variation in data 
here to say anything interesting. The numbers above are averages across 
all of your queries (regardless of how many hits they brough back), for 
a given number of "open indexes". I'm assuming that this is the number 
of indexes opened by the merged searcher. I think most of the 
irregularity comes from ocasional GC activity.
Perhaps if you could run with -verbose:gc option and then subtract the 
time in GC from the search times, we might get a more stable set of 
results. That would show the true difference of the code speed. You 
could also then graph the amount of memory collected by the GC or the 
time spent in GC to see if compound files cause more or less GC.

Could you please tell the list what kind of VM, hardware, OS was used in 
this test? Also, was there just one thread that was searching in all 
these cases? I found that these types of tests vary widely on different 
OSes and hardware. In my experience, Solaris/Sparc/SCSI gives slower 
response but is much more consistent in how it scales with the increase 
in the number of concurrent searches. Whereas a Windows/Intel/IDE can be 
dramatically faster on lower workloads, but starts to become erratic as 
the number of concurrent searches goes up.

I'm going to also attach a Python script that can help analyzing GC 
output of Sun's 1.3 VMs. One problem with those VMs is that they do not 
print when a given GC occured. This makes it hard to analyze. It is 
possible, however, to have a thread that sleeps for 1 second and then 
prints "[TS 234234234]" to the System.out stream, where the number of 
the value of System.currentTimeMillis(). The script I'm including will 
parse that kind of data and produce a comma-separated output of GC stats 
that can be graphed more easily.

Hope you guys find the above useful.
Good luck!

View raw message