lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: File Handle usage of Lucene IndexReader.
Date Fri, 18 Jan 2002 16:11:26 GMT

> The "optimization" operation merges all existing segments into one. 
> Searching is done by openning all segments on disk, evaluating the
> query 
> against each of the segments, and then merging the results. This is 
> pretty efficient, but there is some penalty for searching against 
> multiple segments rather then optimizing and searching against a
> single segment.

I don't have time to check this myself now (going to the airport in a
bit), but does it open the segments (sets of files) sequentially or in
parallel?  I'd guess it's the former, and if that is true then the
number of file handles shouldn't be that high at any one time, no?

> Ok, now for the file handles. Each file in each segment is opened and
> kept open for as long as an IndexReader is open. This, of course,
> eats a 
> file handle for each file. Segments have a varying number of files, 
> depending on how many stored fields exist in the documents, but it is
> at 
> least 8 files or so. So if you have 10 unmerged segments, this is 80 
> file handles (plus the segments file and the deleted file). 10
> indexes 
> with 100 unmerged segments will be pushing 8000 file handles, which
> is as far as anyone dares to jack up the Solaris limit.

What is the reason for not closing each segments after the search
against it is completed?  Performance reasons?

> Which brings me to a proposal: what do people think of changing the 
> optimization process (or adding a secondary optimization step) to
> create 
> a single file that would contain all of the information needed for a 
> segment? Since segments are read-only, this shouldn't cause any 
> problems, right? Then Lucene can allocate a pool of file handles that
> could be dynamically allocated to this segment file and then shared
> by 
> the IndexReader code just as it already does for multiple read
> positions 
> on a single file. In this design, applications can choose how many
> file 
> handles to allocate to Lucene: from one per segment to N per segment.
> With one per segment, the performance would likeley be affected but
> at 
> least it would be up to the application to decide. Any reactions?

Yes (a reaction) :).
I'm about to start developing (well, only designing for now) an
application that will use Lucene and will have many small indices.
There will be more update (delete, re-add) operations than
First, I am wondering whether the issue with exhausing file descriptors
will occurr in that scenario, too.
Second, if what you are suggesting can be done without affecting the
performance (much), I don't see any harm in it.  It sounds as an
improvement, after all.

> >2) Does an IndexSearcher ever close files during it's lifetime? 
> (before
> >being closed)
> >
> No, as far as I know. I think every file needed for evaluation of any
> query, so the only time to close the files would be if there was a 
> period of inactivity when no queries were coming in. If an
> application 
> can detect such a period, it can close the IndexReader/Searcher, but
> I 
> imagine that would be difficult to detect.
> Well, the field files could be closed as long as no one retrieves 
> document bodies for a while. Also if these files are opened on
> demand, 
> and there are index segments from which hits are found seldom, this 
> could help reduce the number of handles. Also, if one segment's field
> files are closed before another one's are opened, this could also
> help, 
> although it will probably slow down document retrieval somewhat. Is
> this a worthwhile enhancement to Lucene?

I don't know enough about that aspect of Lucene to be able to answer
this question.


Do You Yahoo!?
Send FREE video emails in Yahoo! Mail!

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message