lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: idea for reducing file handle use
Date Thu, 18 Sep 2003 20:26:19 GMT

It would be cleaner if this could be done entirely as a Directory 
implementation.  I know some folks who've implemented a 
filesystem-within-a-file solution for this problem that they're very 
happy with.  It is a Directory, and requires no changes to Lucene.  I'll 
ask them if they're willing to contribute it, so that others can use it.


Dmitry Serebrennikov wrote:
> Greetings, Luceeners!
> Looks like lot's of good stuff is happenning with the code as of late. 
> It's great to see this momentum!
> Here's some more action coming your way...
> ---------------------------------------------------------
> We all love Lucene, but most would agree that it tends to use a very 
> large number of file handles.
> This is especially true for applications that have one or more of the 
> factors below:
>    a) use a merged index over a large number of indexes
>    b) experience index updates concurrent with searching
>    c) search through unoptimized indexes
>    d) use high merge factor settings to speed up indexing
>    e) have a large number of indexed fields
> For a long time I've been contemplating an idea that can help 
> drastically reduce the number of file handles needed by Lucene. Now I am 
> finally going to get a few days to make this happen (pending final 
> approval by the powers that be). So, I wanted to put out the general 
> plan of action and seek community comment on it early on. Over the next 
> day or so, I intend to implement the changes outlined below (unless of 
> course I get responses that steer me in a different direction). As I get 
> more solid results (down to the diffs), I'll post them for further 
> review. But the sooner I get feedback, the more chance there is that I 
> will actually be able to incorporate it. Hopefully, this will result in 
> a set of patches that will solve the problems I am after, and be useful 
> enough to the general Lucene population to be included into the tree.
> So, here goes.
> Lucene's indexes are built out of segments. Each segment consists of a 
> number of files, which are written when the segment is created during 
> indexing. Once the IndexWriter is closed, the segment files are not 
> modified, ever (except the file that lists deleted documents, if any). 
> The proposed change is as follows:
>    - add code to IndexWriter.close() method to combine all of the 
> segment's files into a single file with a header that indicates start 
> offset and a length for each of the new file's components, corresponding 
> to individual files in the current segments. This will be done in such a 
> way that the file will be able to contain any number of components - 
> this way we can support evolution of the segment structure in the 
> future. The deleted documents file will remain separate.
>    - add a new segment reader, or add code to the existing one, to work 
> with these types of segments
>    - when this new segment reader opens its files, it can open one file 
> object from the Directory for the combined file and then clone it for 
> each of the files formerly in the segment. Each cloned file object would 
> maintain its own position into the combined file and will have its own 
> buffer as they did before. They will also need to know the starting 
> offset and a length of their fragment of the combined file.
> Some questions to solicit feedback:
> *) I don't know all of the classes that will need changes yet, but I 
> think this can be accomplished with moderate effort in the index and 
> maybe the store packages. Does this seem reasonable?
> *) I can't see any adverse effects of this change except possibly one. 
> Since less OS file handles will be used, the way OS caching is applied 
> to Lucene indexes will change. I know that Lucene relies on OS-level 
> file caching for a good part of its performance magic, but I lack the 
> right experience to know what effect the proposed change will have on 
> the performance. There should be the same number of disk accesses 
> overall, but obviously there will be concentrated in a single file and 
> will be more spread out. The disk should not really thrash any more than 
> before, since the same data will be read in the same order, just now it 
> will be in a single file rather than in different files. However, if OS 
> file caching is optimal only when a given file handle experiences 
> sequential reads, this can be a problem. Can anyone shed some light on 
> what we can expect with this change? I am primarially interested in 
> Solaris and Windows (NT/2000) at this time, but I'd like to know of 
> possible impact on other OSes as well.
> *) Given the above, is this a wothwhile idea? If not, can we modify it 
> so as to limit the performance impact?
> Thanks for your consideration and feedback.
> Dmitry.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message