lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject idea for reducing file handle use
Date Thu, 18 Sep 2003 19:04:08 GMT
Greetings, Luceeners!

Looks like lot's of good stuff is happenning with the code as of late. 
It's great to see this momentum!
Here's some more action coming your way...

---------------------------------------------------------
We all love Lucene, but most would agree that it tends to use a very 
large number of file handles.
This is especially true for applications that have one or more of the 
factors below:
    a) use a merged index over a large number of indexes
    b) experience index updates concurrent with searching
    c) search through unoptimized indexes
    d) use high merge factor settings to speed up indexing
    e) have a large number of indexed fields
For a long time I've been contemplating an idea that can help 
drastically reduce the number of file handles needed by Lucene. Now I am 
finally going to get a few days to make this happen (pending final 
approval by the powers that be). So, I wanted to put out the general 
plan of action and seek community comment on it early on. Over the next 
day or so, I intend to implement the changes outlined below (unless of 
course I get responses that steer me in a different direction). As I get 
more solid results (down to the diffs), I'll post them for further 
review. But the sooner I get feedback, the more chance there is that I 
will actually be able to incorporate it. Hopefully, this will result in 
a set of patches that will solve the problems I am after, and be useful 
enough to the general Lucene population to be included into the tree.

So, here goes.

Lucene's indexes are built out of segments. Each segment consists of a 
number of files, which are written when the segment is created during 
indexing. Once the IndexWriter is closed, the segment files are not 
modified, ever (except the file that lists deleted documents, if any). 
The proposed change is as follows:
    - add code to IndexWriter.close() method to combine all of the 
segment's files into a single file with a header that indicates start 
offset and a length for each of the new file's components, corresponding 
to individual files in the current segments. This will be done in such a 
way that the file will be able to contain any number of components - 
this way we can support evolution of the segment structure in the 
future. The deleted documents file will remain separate.
    - add a new segment reader, or add code to the existing one, to work 
with these types of segments
    - when this new segment reader opens its files, it can open one file 
object from the Directory for the combined file and then clone it for 
each of the files formerly in the segment. Each cloned file object would 
maintain its own position into the combined file and will have its own 
buffer as they did before. They will also need to know the starting 
offset and a length of their fragment of the combined file.

Some questions to solicit feedback:
*) I don't know all of the classes that will need changes yet, but I 
think this can be accomplished with moderate effort in the index and 
maybe the store packages. Does this seem reasonable?
*) I can't see any adverse effects of this change except possibly one. 
Since less OS file handles will be used, the way OS caching is applied 
to Lucene indexes will change. I know that Lucene relies on OS-level 
file caching for a good part of its performance magic, but I lack the 
right experience to know what effect the proposed change will have on 
the performance. There should be the same number of disk accesses 
overall, but obviously there will be concentrated in a single file and 
will be more spread out. The disk should not really thrash any more than 
before, since the same data will be read in the same order, just now it 
will be in a single file rather than in different files. However, if OS 
file caching is optimal only when a given file handle experiences 
sequential reads, this can be a problem. Can anyone shed some light on 
what we can expect with this change? I am primarially interested in 
Solaris and Windows (NT/2000) at this time, but I'd like to know of 
possible impact on other OSes as well.
*) Given the above, is this a wothwhile idea? If not, can we modify it 
so as to limit the performance impact?

Thanks for your consideration and feedback.
Dmitry.



Mime
View raw message