lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 28183] New: - [Patch] replace DocumentWriter with InvertedDocument for performance
Date Sat, 03 Apr 2004 22:36:45 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=28183>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=28183

[Patch] replace DocumentWriter with InvertedDocument for performance

           Summary: [Patch] replace DocumentWriter with InvertedDocument for
                    performance
           Product: Lucene
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: Enhancement
          Priority: Other
         Component: Index
        AssignedTo: lucene-dev@jakarta.apache.org
        ReportedBy: brian-apache@slesinsky.org


I've found a way to improve Lucene's indexing performance by about 45% for my dataset.

Here's how it works:  currently the indexing process goes like this:

- use DocumentWriter to create an inverted index and serialize a one-document segment to a

RAMDirectory
- when enough documents have been read, deserialize the one-document segments in the 
RAMDirectory and merge them, writing the merged segment to disk.

What I've done instead is create a new class, InvertedDocument, that keeps the inverted index
in a Map, 
and can also be used directly as input for a merge.  This avoids the serialization/deserialization
step, 
and the RAMDirectory is no longer used when indexing.

The patch applies to the contents of CVS as of today (April 3).  (It's a big patch and includes
some 
minor style tweaks that aren't directly related.)

I did the performance testing using a simple application that creates an index from a file
containing 
messages extracted from a bulletin board.  It could index about 100 kilobytes/second with
Lucene 1.3, 
and 145 kilobytes/second with the patch.  This is on an 700Mhz eMac, which is pretty slow
at Java, and 
the documents being indexed are, on average, less than a screenful.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message