lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Created: (LUCENE-629) Performance improvement for merging stored, compressed fields
Date Mon, 17 Jul 2006 21:49:14 GMT
Performance improvement for merging stored, compressed fields

                 Key: LUCENE-629
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Michael Busch
            Priority: Minor

Hello everyone,

currently the merging of stored, compressed fields is not optimal for the following reason:
every time a stored, compressed field is being merged, the FieldsReader uncompresses the data,
hence the FieldsWriter has to compress it again when it writes the merged fields data (.fdt)
file. The uncompress/compress step is unneccessary and slows down the merge performance significantly.

This patch improves the merge performance by avoiding the uncompress/compress step. In the
following I give an overview of the changes I made:
   * Added a new FieldSelectorResult constant named "LOAD_FOR_MERGE" to org.apache.lucene.document.FieldSelectorResult
   * SegmentMerger now uses an FieldSelector to get stored fields from the FieldsReader. This
FieldSelector's accept() method returns the FieldSelectorResult "LOAD_FOR_MERGE" for every
   * Added a new inner class to FieldsReader named "FieldForMerge", which extends  org.apache.lucene.document.AbstractField.
This class holds the field properties and its data. If a field has the FieldSelectorResult
"LOAD_FOR_MERGE", then the FieldsReader creates an instance of "FieldForMerge" and does not
uncompress the field's data.
   * FieldsWriter checks if the field it is about to write is an instanceof FieldsReader.FieldForMerge.
If true, then it does not compress the field data.

To test the performance I index about 350,000 text files and store the raw text in a stored,
compressed field in the lucene index. I use a merge factor of 10. The final index has a size
of 366MB. After building the index, I optimize it to measure the pure merge performance.

Here are the performance results:

old version:
   * Time for Indexing:  36.7 minutes
   * Time for Optimizing: 4.6 minutes

patched version:
   * Time for Indexing:  20.8 minutes
   * Time for Optimizing: 0.5 minutes

The results show that the index build time improved by about 43%, and the optimizing step
is more than 8x faster. 

A diff of the final indexes (old and patched version) shows, that they are identical. Furthermore,
all junit testcases succeeded with the patched version. 

  Michael Busch

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message