lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: file handle changes
Date Mon, 22 Sep 2003 23:15:48 GMT
Greetings again.

I've implemented the file handle reduction changes, roughly as proposed 
before. Here are the patches for your enjoyment! :)

------------------------------------------
SUMMARY:
The goal of this patch is to drastically reduce the number of file 
handles required by Lucene. This is achieved by reducing the number of 
files required by a single index segment from N to 1, where N depends on 
the number of indexed fields in the segment. Typically, one should see a 
drop in the number of file handles by an order of magnitude! It could 
even be greater for indexes that contain large numbers of indexed fields.

The best part is that to take advantage of this feature, one simply 
needs to call setUseCompoundFiles(true) on an IndexWriter before putting 
documents into it. Everything else is automatic!

------------------------------------------
DETAILS:
The proposed implementation adds a new property to the IndexWriter -- 
get/setUseCompoundFiles(boolean). This property defaults to false, which 
is the existing behavior prior to this patch. If the property is set to 
true, all segments created by this IndexWriter will be of the "compound 
file" format. Compound file segments have only one main file - <id>.cfs. 
Document deletions are handled as before -- if documents from this 
segment are deleted, a second file named <id>.del is created (I didn't 
change this code).

The get/setUseCompoundFiles setting can be changed at any time during 
the existance of the IndexWriter and takes effect during the next time 
the IndexWriter merges segments in its target directory. 
SegmentIndexReader can now work with either type of segment.

This change does not affect how the segments are handled in the 
temporary RAMDirectory used by the IndexWriter internally, only the 
final segments written to the target directory. Also, a given directory 
can contain both types of segments and everything works out automagically.

-----------------------------------------
I have also created a new JUnit test case to test these features, which 
runs successfully. For the moment it creates files off of the current 
working directory in which the junit is executed. I also converted some 
of the older tests "XXXTest" into "TestXXX", and made sure they work 
with the old implementation and the new one. These tests do not yet do 
enough assert(...) calls, but they now execute twice: with the 
multi-file indexes and the new compound file indexes, and assert that 
the output is the same. The old files are still there, I just added new 
ones with the inverted names. In one case - ThreadSafetyTest.java - I 
actually made changes to that file because I thougt this test was too 
long to run as an automatic test in JUnit. Build.xml required a small 
change to add a class from the src/demo tree to the classpath.

----------------------------------------
Doug, I've really considered keeping everything at the Directory level, 
as you suggested. This would have been preferred, I agree, but I really 
couldn't find a way to reconsile this approach with the other two goals 
I had: (a) keep specific file extension knowledge at the lucene.index.* 
level where it is now, and (b) avoid having to support writes to the 
compound file.

----------------------------------------
I'm attaching the patches against the current Lucene CVS source 
(basically output of "cvs diff -Buw"). The files listed as "?" are new 
files and are also attached.

(BTW, there are currently two failures in the existing JUnit test cases, 
but they occur with or without these patches, as has already been noted 
by Otis, Doug and Eric).

Finally, I should theoretically have commit access to Lucene's CVS, but 
I've never tried using it yet. If these changes seem ok, I could commit 
them myself (provided I can find my password, etc., etc.).

Enjoy.
Dmitry.


Mime
View raw message