Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 31387 invoked from network); 17 Jan 2002 19:17:14 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 17 Jan 2002 19:17:14 -0000 Received: (qmail 8567 invoked by uid 97); 17 Jan 2002 19:17:17 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 8527 invoked by uid 97); 17 Jan 2002 19:17:15 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 8516 invoked from network); 17 Jan 2002 19:17:14 -0000 Message-ID: <3C472398.8000107@earthlink.net> Date: Thu, 17 Jan 2002 12:18:48 -0700 From: Dmitry Serebrennikov User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.7) Gecko/20011221 X-Accept-Language: en-us MIME-Version: 1.0 To: Lucene Developers List CC: dsnyder@netgenics.com, @earthlink.net Subject: Re: File Handle usage of Lucene IndexReader. References: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Snyder, David wrote: >I am trying to understand the file handle usage of an IndexSearcher. Let me >explain a little about our application: We generally employ many indexes >that are updated on a nightly basis. When updating, we both add and delete >documents. We only optimize on a weekly basis. Incidentally, we run on >Solaris and the sun JDK 1.3 and our updates are done in a separate process >from our searches. It is the search file handle usage that is the issue. > Since no one has picked this up yet, let me give it a start. We also had issues with file handles, although so far adjusting the OS limit has solved our problem (see /etc/system file, and the rlim_fd_max with the rlim_fd_cur settings). But I can see how with many indexes this would not help. Lucene indexes are stored in segments, where each segment is composed of a number of files. Each segment can contain anywhere from 1 to all documents in the index. When documents are added to the index, new segments are created for those documents. This allows segments to be read-only, since existing segments are never updated (save for marking documents as deleted). Each new document is given a new segment! But since the segments are first created in RAMDirectory and then flushed onto the FileDirectory, the new segments in the FileDirectory have 10 documents each. The number 10 is hard-coded, I believe. The segments are identified by a common file name. The files within a segment are identified by their extensions. File "segments" is the control file that names all segments participating in a given index. The "optimization" operation merges all existing segments into one. Searching is done by openning all segments on disk, evaluating the query against each of the segments, and then merging the results. This is pretty efficient, but there is some penalty for searching against multiple segments rather then optimizing and searching against a single segment. Ok, now for the file handles. Each file in each segment is opened and kept open for as long as an IndexReader is open. This, of course, eats a file handle for each file. Segments have a varying number of files, depending on how many stored fields exist in the documents, but it is at least 8 files or so. So if you have 10 unmerged segments, this is 80 file handles (plus the segments file and the deleted file). 10 indexes with 100 unmerged segments will be pushing 8000 file handles, which is as far as anyone dares to jack up the Solaris limit. > > >Due to the large number of indexes involved in some of our searches, we are >maxing out on the number of file handles available to the process running >lucene. It seems that with each update, the number of file handles an >IndexSearcher requires balloons. This effect multiplied by our many indexes >runs us into the OS limit. > Right. One solution is to optimize more frequently. You can also optimize on the secondary systems that do your indexing and then move the files onto the main system. Or you can use the API that does the merging to merge segments from one directory (where there were prepared and optimized) into your main working directory. However, even with one segment per index, the number of file handles is still large especially if you use many indexes and if your application has to open files for other reasons (like socket descriptors for serving web requests for example). Which brings me to a proposal: what do people think of changing the optimization process (or adding a secondary optimization step) to create a single file that would contain all of the information needed for a segment? Since segments are read-only, this shouldn't cause any problems, right? Then Lucene can allocate a pool of file handles that could be dynamically allocated to this segment file and then shared by the IndexReader code just as it already does for multiple read positions on a single file. In this design, applications can choose how many file handles to allocate to Lucene: from one per segment to N per segment. With one per segment, the performance would likeley be affected but at least it would be up to the application to decide. Any reactions? > > > A couple of specific questions I had: > >1) Does an IndexSearcher need a file handle to each file in the index >directory? > Yes. > >2) Does an IndexSearcher ever close files during it's lifetime? (before >being closed) > No, as far as I know. I think every file needed for evaluation of any query, so the only time to close the files would be if there was a period of inactivity when no queries were coming in. If an application can detect such a period, it can close the IndexReader/Searcher, but I imagine that would be difficult to detect. Well, the field files could be closed as long as no one retrieves document bodies for a while. Also if these files are opened on demand, and there are index segments from which hits are found seldom, this could help reduce the number of handles. Also, if one segment's field files are closed before another one's are opened, this could also help, although it will probably slow down document retrieval somewhat. Is this a worthwhile enhancement to Lucene? > >3) Is there any reason why version 1.2 would use significantly more file >handles than version 1.0 (I'm aware of the extra locking files) > Not sure. It should use less because somewhere between the two IndexReader was changed to use one handle per file instead of one handle per query term (!). > >4) Would the fact that we are using indexes created with version 1.0 code >base but updating with version 1.2 make any difference? Are indexes >compatable between the two versions? > As far as I know, yes. Shouldn't have any effect. It's the lack of optimization that is likeley killing you. > > >Any help explaining how this works would be appreciated... I struggled >whether to post this to the user group vs. the dev group. It seemed like >the level question I was asking is more appropriate for this list. I >apologize if I got it wrong. > Seems appropriate to me. Anyway, I was going to propose the merging of files one way or another. Best regards. Dmitry. -- To unsubscribe, e-mail: For additional commands, e-mail: