lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <>
Subject potential indexing perormance improvement for compound index - cut IO - have more files though
Date Fri, 15 Dec 2006 07:31:04 GMT


I would like to propose and get feedback on a potential indexing
performance improvement for the case that compound file is used (this is
the default).

In compound segment mode, each merge operation is ended by writing a
compound file. To be more precise, the merge result is first written to
directory as a non-compound segment file, and then it is 'converted' into a
compound segment file. This conversion involves reading the entire (non
compound) segment, and writing it as a compound segment file. This means
that compound mode indexing does twice as much indexing writing comparing
to non-compound mode. (and there's also the reading of the non-compound

The reason for this two steps process in writing compound segment files is
that per-segment files cannot be written sequentially, one by one - several
files are created together, written interleaved.

But I think that there is an intermediate state - between
one-compound-segment-file and non-compound-many-files.

To my understanding, at merge time, the following apply:
- .fnm - field infos - independent of other files.
- .fdx .fdt - store fields - interleaved with each other, independent of
other files.
- .tis .tii .frq .prx - dictionary and postings - interleaved with each
other, independent of other files.
- .tvx .tvd .tvf - term vectors - interleaved with each other, independent
of other files.
- .fN - norms - all these files written sequentially, independent of other

Therefore, a "semi compound" segment file can be defined, that would be
made of 4 files (instead of 1):
- File 0: .fdx .tis .tvx
- File 1: .fdt .tii .tvd
- File 2: .frq .tvf
- File 3: .fnm .prx .fN

A merge should be able to write this segment representation at once, - no
need to read and write again.

Few questions:
(1) is this correct at all, or have I overlooked something?
(2) what performance gain would that buy?
(3) is it reasonable to have 4 files per segment comparing to 1 file per

For (2), the indexing performance of non compound is an upper bound. I
compared indexing speeds of compound and non compound, using the Reuters
input set. Tried with stored+vectors, and without stored fields:

 round  vect  stor cmpnd   runCnt   recsPerRun        rec/s  elapsedSec
     0  true  true  true        1        21578        150.2      143.69
 -   1  true  true false -  -   1 -  -   21578 -  -   178.9 -  - 120.58
     2 false false  true        1        21578        164.7      131.03
 -   3 false false false -  -   1 -  -   21578 -  -   184.3 -  - 117.07

This is a 19% speed-up with stored+vectors, and 12% speed-up with no stored

As a side comment, it says something on IO vs. CPU in Lucene indexing, that
cutting 1/2 (I think) the file output speeds-up by less than 20%.

But anyhow, this is not a negligible difference, and for real large
indexes, and busy systems, when the just written non-compound segment is
not in the system caches, it might have more effect. Possibly, search
performance during indexing would be improved by less indexing IO. Also,
delay for addDocument call that triggers a merge should become smaller.

Thanks for your comments, also (but not only) on (1) an (3) above.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message