lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ivaylo Zlatev" <IZla...@entigen.com>
Subject RE: new version of IndexWriter.java
Date Wed, 27 Feb 2002 20:29:17 GMT

My benchmarks show that my IndexWriter2.java performs better than the
original IndexWriter.java,
and - very important - preserves file system handles.

Here are results of indexing on-and-the-same 11800 records on
a poor SunBlade machine (1 cpu, 450mhz, decent IDE hard drive) with
Solaris 8 OS.
my "ulimit -n" (i.e. number of available file handles) is set to 1000 in
all tests.
I am using jdk 1.3.1 with upper memory limit of 512mBytes.

IndexWriter with mergeFactor=10 : 93 seconds
IndexWriter with mergeFactor=20 : 63 seconds
IndexWriter with mergeFactor=50 : 50 seconds
IndexWriter with mergeFactor=60 : 48 seconds
IndexWriter with mergeFactor=70 : 48 seconds
IndexWriter with mergeFactor=80 : Exception when adding ~6300-th
Document: too many open files 
IndexWriter with mergeFactor=100: Exception when adding ~9900-th
Document: too many open files

IndexWriter2 with maxDocsInRam=100  , mergeFactor=10 (used only at the
end of indexing, during optimize() ) : 58 seconds
IndexWriter2 with maxDocsInRam=300  , mergeFactor=10 (used only at the
end of indexing, during optimize() ) : 45 seconds
IndexWriter2 with maxDocsInRam=500  , mergeFactor=10 (used only at the
end of indexing, during optimize() ) : 42 seconds
IndexWriter2 with maxDocsInRam=700  , mergeFactor=10 (used only at the
end of indexing, during optimize() ) : 42 seconds
IndexWriter2 with maxDocsInRam=1000 , mergeFactor=10 (used only at the
end of indexing, during optimize() ) : 43 seconds
IndexWriter2 with maxDocsInRam=2000 , mergeFactor=10 (used only at the
end of indexing, during optimize() ) : 46 seconds
IndexWriter2 with maxDocsInRam=5000 , mergeFactor=10 (used only at the
end of indexing, during optimize() ) : 46 seconds
IndexWriter2 with maxDocsInRam=10000, mergeFactor=10 (used only at the
end of indexing, during optimize() ) : 50 seconds
IndexWriter2 with maxDocsInRam=20000, mergeFactor=10 (used only at the
end of indexing, during optimize() ) : 56 seconds

As you can see, IndexWriter2 with default settings (mergeFactor=10,
maxDocsInRam=2000) outperforms 
IndexWriter  with default settings (mergeFactor=10) twice (46 seconds
compared to 93 seconds).

Maybe you will ask why the time for indexing in IndexWriter2  increases
when we increase maxDocsInRam above 1000 records? 
I assume that the OS maintains some write buffer in memory, which can
hold up to 5000 of my records. When 
maxDocsInRam is big, it transfers a big segment  (of 10000 records, for
example) to the file system, which fills
up that buffer. Probably on Windows these results will look much
different. If anyone is interested, I can run
them on Windows 2000.


Now about the PriorityQueue object: The
org.apache.lucene.util.PriorityQueue uses a cool partial
ordering of its elements, which only a sick genius can invent. I have
looked at other PRiorityQueue
objects and they look as plain as you can imagine (I don't know about
the one in Jakarta's Commons Collections, though).
The org.apache.lucene.util.PriorityQueue looks like it was quickly
ported from another language - probably C, but
was not polished enough. For example, the internal array that the queue
uses is twice bigger than necessary, which is a big waste of memory.
Anyway, the new PriorityQueue addresses all issues, but someone has to
incorporate it in Lucene.

Regards, Ivaylo




-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: Tuesday, February 26, 2002 11:53 AM
To: Lucene Developers List
Subject: Re: new version of IndexWriter.java


Ivaylo,

Thanks for the contribution.  It sounds good, although I haven't looked
at it yet.  Do you have any performance numbers?  I'm curious how it
compares to the original IndexWriter.

As for your PriorityQueue, it's still sitting flagged in my Lucene
folder for review.
I've been meaning to send a reply with the following question, not just
for you, but for Doug and others as well:
Is there anything special, anything Lucene-specific in that
PriorityQueue?  If not, there is a PriorityQueue implementation in
Jakarta's Commons Collections sub-project which we could (re)use
instead of having our own.  On the other hand, this requires that we
include the collections jar in lib.

Just some thoughts.
In any case, sorry for not replying, the contribution _is_ appreciated.

Otis


--- Ivaylo Zlatev <IZlatev@entigen.com> wrote:
> 
> Yesterday I was inspired by the conversation on the dev. list about
> indexing in memory, etc 
> and I wrote a new version of IndexWriter.java (it is named
> IndexWriter2.java). Find the attached file here. The code is stable
> and
> worth a try. The following is from the javaDocs for this file:
> 
> /**
>  * IndexWriter2 is a modification of the original IndexWriter, coming
>  * with lucene. It benefits from a RAMDirectory, which IndexWriter
> has
>  * as well. The original IndexWriter treats the segments in the
> RAMDirectory
>  * no different from the segments in the target directory, where the
> index is
>  * being built. For example, it ALWAYS merges RAMDirectory segments
> in
> the
>  * target directory. Here, we optimize the usage of RAMDirectory in
> the
>  * following way:<br>
>  *
>  * When a new Document is added, a new segment for it is created in
>  * RAMDirectory. When the RAMDirectory collects 'maxDocsInRam' (this
> is
> a new
>  * important setting, the default is 10000) 1-document
>  * segments, IndexWriter2 will merge them into one 10000-documents
> segment into
>  * RAMDirectory (here is a difference from IndexWriter). Then it
> moves
> this
>  * segment from the RAMDirectory to the target directory (usually a
> file
> system
>  * directory). This way, during indexing, IndexWriter2 will be
> writing
> segments
>  * of equal size (equal to maxDocsInRam) to the target directory. In
> other
>  * words, during indexing only one file-system segment is opened and
> dealt with,
>  * which uses just a few file handles. No more "Too many open files"
>  * exceptions.<br>
>  *
>  * After indexing is finished, it is good to call optimize() to merge
> all
>  * created segments into one. The RAMDirectory is out of the picture
> here and
>  * is not being used. Here is where we use the mergeFactor setting:
>  * A total of mergeFactor+1 segments will be merged at once into one
> new
>  * segment. This happens in a loop, until only 1 segment is left.
>  * Here you can get  to a "Too many open files" exception, if your
> mergeFactor
>  * is large. If you set mergeFactor to 1, it will merge only 2
> segments
> at a
>  * time, which will preserve the file handles, but will be a bit
> slower
> than
>  * a merge with  mergeFactor=10, for example.<br>
>  *
>  * At the end of mergeSegments() originally there was a code, where,
> if
> a
>  * segment file can't be deleted (because it's currently opened in
> Windows),
>  * it stores it's name in a file, named 'deletable', so that it can
> try
> to
>  * delete it later. I believe there was some bug with not closing the
> merged
>  * segments properly, which was the reason for all of this. Anyway,
> now
> there
>  * are no problems with deleting these files on Windows and therefore
> the code,
>  * reading and writing to the 'deletable' file is commented out.<br>
>  *
>  * @author Ivaylo Zlatev (ivaylo_zlatev@yahoo.com)
>  */
> 
> 
> Two weeks ago I sent an improved PriorityQueue, fixing important
> memory
> issues and
> much more. I just wasted my time - no response at all. Hopefully this
> time my code will be more useful.
> 
> Regards, Ivaylo
>  <<IndexWriter2.java>> 
> 

> ATTACHMENT part 2 application/octet-stream name=IndexWriter2.java
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>


__________________________________________________
Do You Yahoo!?
Yahoo! Greetings - Send FREE e-cards for every occasion!
http://greetings.yahoo.com

--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message