lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 27326] New: - [PATCH] minor performance enhancements for DocumentWriter.invertDocument()
Date Mon, 01 Mar 2004 07:10:10 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27326>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27326

[PATCH] minor performance enhancements for DocumentWriter.invertDocument()

           Summary: [PATCH] minor performance enhancements for
                    DocumentWriter.invertDocument()
           Product: Lucene
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: Enhancement
          Priority: Other
         Component: Index
        AssignedTo: lucene-dev@jakarta.apache.org
        ReportedBy: brian-apache@slesinsky.org


This patch includes two small performance improvements:

1. switch from Hashtable to HashMap and preset the capacity to avoid resizing the HashMap
(barely 
measurable improvement, but easy).

2. add a new Analyzer.tokenStream() method that takes a String instead of a Reader, and call
this from 
within DocumentWriter.invertDocument().  This allows subclasses of Analyzer to provide a more

efficient tokenizer for Strings.  (The default implementation just uses a StringReader.)

I was able to write a variant on LowercaseAnalyzer (not included) that's about 10% faster
for my dataset.  
It works by converting the entire field value with String.toLowerCase() and then using String.substring()

to extract the string for each token.  This avoids allocating individual char[] arrays inside
String for each 
token, because String.substring() shares its char[] array with the original.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message