Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 58443 invoked from network); 7 May 2002 17:35:52 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 7 May 2002 17:35:52 -0000 Received: (qmail 28669 invoked by uid 97); 7 May 2002 17:35:52 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 28594 invoked by uid 97); 7 May 2002 17:35:51 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 28571 invoked by uid 98); 7 May 2002 17:35:50 -0000 X-Antivirus: nagoya (v4198 created Apr 24 2002) Message-ID: <20020507173542.32306.qmail@mailshell.com> Subject: RE: FileNotFoundException: Too many open files Date: Tue, 7 May 2002 10:34:43 -0700 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" From: apache@lucene.com To: lucene-user@jakarta.apache.org X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Thanks, Dmitry. Here's a little more detail: > From: Dmitry Serebrennikov > > The index directory has the following files: > deletable - one, lists segment ids that can be deleted when no > longer locked by the filesystem because they are open > segments - one, lists segment ids of the current set > of segments > _.tii - one per segment, "term index" file This is the term infos index file. It contains every 128th entry from the "tis" file, along with its location in the "tis" file. This is read entirely into memory and is used to provide random access to the "tis" file. > _.tis - one per segment, "term infos" file This is the term infos file. Its logical format is *, where t is the term, df is the "document frequency", or count of documents containing t, freqLoc is the location of t's data in the "frq" file, and proxLoc is the location of t's data in the "prx" file. > _.frq - one per segment, "term frequency" file This is the frequency file. It contains the frequency of each term in each document. Its logical format is <*>*, where d is a document number, and f is the number of times the term ocurred in that document. The TermDocs interface is used to access this data. > _.prx - one per segment, "term positions" file This is the proximity file. It contains the positions of each term in each document. Its logical format is <

*>*, where p is an ordinal position of a term. The TermPositions interface is used to access this data. > _.fdx - one per segment, "field index" file This is the field index file. It contains the location of each document's stored fields in the "fdt" file. Its logical format is *, where docLoc_i is the location in the "fdt" of document i. This is read entirely into memory and is used to provide random access to a document's stored fields. > _.fdt - one per segment, "field infos" file This is the field data file. It contains each document's stored fields. Its logical format is <*>*. > _.fnm - one per segment, "field infos" file This is the field info file. It contains the names of the fields. > _.f - one per segment per stored field, "field data" file These are the normalization files. They contain one byte for each field in each document that is multiplied into the score of hits on that field of that document. > - is the segment number, encoded using numbers and letters > - is the field number, which is a unique field id in that segment. > An index should have 2 + n * (7 + m) files, where n is the number of > segments and m is the number of stored fields. For an optimized index > with one stored field this gives 10 files (not a 100!). The maximum number of segments an unoptimized index can have is: (m-1) * (log_m(n)-1) Where m is the mergeFactor, 10 by default and n is the number of documents added since the index was last optimized. The average number of segments is about half that. So a ~1M document index that is never optimized can have, at most, 45 segments. If you optimize every 10k documents, then you can limit things to 27 segments. Or you can manage things more explicitly with tools like RAMDirectory and IndexWriter.addIndexes(). Doug -- To unsubscribe, e-mail: For additional commands, e-mail: