Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Message-ID: <3C472398.8000107@earthlink.net>
Date: Thu, 17 Jan 2002 12:18:48 -0700
From: Dmitry Serebrennikov <dmitrys@earthlink.net>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
 rv:0.9.7) Gecko/20011221
MIME-Version: 1.0
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
CC: dsnyder@netgenics.com, @earthlink.net
Subject: Re: File Handle usage of Lucene IndexReader.
References: 
 <DB52A7625347D211A1D70060B06AB1E2035DC91F@exchgcle.dev.netgenics.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit

Snyder, David wrote:

>I am trying to understand the file handle usage of an IndexSearcher.  Let me
>explain a little about our application:  We generally employ many indexes
>that are updated on a nightly basis.  When updating, we both add and delete
>documents.  We only optimize on a weekly basis.  Incidentally, we run on
>Solaris and the sun JDK 1.3 and our updates are done in a separate process
>from our searches.  It is the search file handle usage that is the issue.
>
Since no one has picked this up yet, let me give it a start.

We also had issues with file handles, although so far adjusting the OS 
limit has solved our problem (see /etc/system file, and the rlim_fd_max 
with the rlim_fd_cur settings). But I can see how with many indexes this 
would not help.

Lucene indexes are stored in segments, where each segment is composed of 
a number of files. Each segment can contain anywhere from 1 to all 
documents in the index. When documents are added to the index, new 
segments are created for those documents. This allows segments to be 
read-only, since existing segments are never updated (save for marking 
documents as deleted). Each new document is given a new segment! But 
since the segments are first created in RAMDirectory and then flushed 
onto the FileDirectory, the new segments in the FileDirectory have 10 
documents each. The number 10 is hard-coded, I believe. The segments are 
identified by a common file name. The files within a segment are 
identified by their extensions. File "segments" is the control file that 
names all segments participating in a given index.

The "optimization" operation merges all existing segments into one. 
Searching is done by openning all segments on disk, evaluating the query 
against each of the segments, and then merging the results. This is 
pretty efficient, but there is some penalty for searching against 
multiple segments rather then optimizing and searching against a single 
segment.

Ok, now for the file handles. Each file in each segment is opened and 
kept open for as long as an IndexReader is open. This, of course, eats a 
file handle for each file. Segments have a varying number of files, 
depending on how many stored fields exist in the documents, but it is at 
least 8 files or so. So if you have 10 unmerged segments, this is 80 
file handles (plus the segments file and the deleted file). 10 indexes 
with 100 unmerged segments will be pushing 8000 file handles, which is 
as far as anyone dares to jack up the Solaris limit.

>
>
>Due to the large number of indexes involved in some of our searches, we are
>maxing out on the number of file handles available to the process running
>lucene.  It seems that with each update, the number of file handles an
>IndexSearcher requires balloons.  This effect multiplied by our many indexes
>runs us into the OS limit.
>
Right. One solution is to optimize more frequently. You can also 
optimize on the secondary systems that do your indexing and then move 
the files onto the main system. Or you can use the API that does the 
merging to merge segments from one directory (where there were prepared 
and optimized) into your main working directory.

However, even with one segment per index, the number of file handles is 
still large especially if you use many indexes and if your application 
has to open files for other reasons (like socket descriptors for serving 
web requests for example).

Which brings me to a proposal: what do people think of changing the 
optimization process (or adding a secondary optimization step) to create 
a single file that would contain all of the information needed for a 
segment? Since segments are read-only, this shouldn't cause any 
problems, right? Then Lucene can allocate a pool of file handles that 
could be dynamically allocated to this segment file and then shared by 
the IndexReader code just as it already does for multiple read positions 
on a single file. In this design, applications can choose how many file 
handles to allocate to Lucene: from one per segment to N per segment. 
With one per segment, the performance would likeley be affected but at 
least it would be up to the application to decide. Any reactions?

>
>
> A couple of specific questions I had:
>
>1) Does an IndexSearcher need a file handle to each file in the index
>directory?
>
Yes.

>
>2) Does an IndexSearcher ever close files during it's lifetime?  (before
>being closed)
>
No, as far as I know. I think every file needed for evaluation of any 
query, so the only time to close the files would be if there was a 
period of inactivity when no queries were coming in. If an application 
can detect such a period, it can close the IndexReader/Searcher, but I 
imagine that would be difficult to detect.

Well, the field files could be closed as long as no one retrieves 
document bodies for a while. Also if these files are opened on demand, 
and there are index segments from which hits are found seldom, this 
could help reduce the number of handles. Also, if one segment's field 
files are closed before another one's are opened, this could also help, 
although it will probably slow down document retrieval somewhat. Is this 
a worthwhile enhancement to Lucene?

>
>3) Is there any reason why version 1.2 would use significantly more file
>handles than version 1.0 (I'm aware of the extra locking files)
>
Not sure. It should use less because somewhere between the two 
IndexReader was changed to use one handle per file instead of one handle 
per query term (!).

>
>4) Would the fact that we are using indexes created with version 1.0 code
>base but updating with version 1.2 make any difference?  Are indexes
>compatable between the two versions?
>
As far as I know, yes. Shouldn't have any effect. It's the lack of 
optimization that is likeley killing you.

>
>
>Any help explaining how this works would be appreciated...  I struggled
>whether to post this to the user group vs. the dev group.  It seemed like
>the level question I was asking is more appropriate for this list.  I
>apologize if I got it wrong.
>
Seems appropriate to me. Anyway, I was going to propose the merging of 
files one way or another.

Best regards.
Dmitry.


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>