lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Updated: (LUCENE-624) Segment size limit for compound files
Date Wed, 26 Jul 2006 22:18:16 GMT
     [ ]

Michael Busch updated LUCENE-624:

    Attachment: cfs_seg_size_limit.patch

I attach the patch file for this improvement.

This patch adds two new methods to the API of IndexWriter and IndexModifier:
  /** Get the current value of the compound file segment size limit.
   *  Note that this just returns the value you set with setCompoundFileSegmentSizeLimit(int)
   *  or the default. You cannot use this to query the status of an existing index.
   *  @see #setCompoundFileSegmentSizeLimit(int)
  public int getCompoundFileSegmentSizeLimit();
  /** Sets the limit of documents a segment can have, so that
   *  compound format is being used for that segment. A high
   *  limit will decrease the number of files per index, whereas
   *  a lower limit will improve search performance but 
   *  increase the number of files.
  public void setCompoundFileSegmentSizeLimit(int value);

Furthermore I added a constant to IndexWriter:

Since the default value is set to Integer.MAX_VALUE, the behavior of IndexWriter/IndexModifier
only changes if the user uses setCompoundFileSegmentSizeLimit(int) to change the value explicitly.

> Segment size limit for compound files
> -------------------------------------
>                 Key: LUCENE-624
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>         Attachments: cfs_seg_size_limit.patch
> Hello everyone,
> I implemented an improvement targeting compound file usage. Compound files are used to
decrease the number of index files, because operating systems can't handle too many open file
descriptors. On the other hand, a disadvantage of compound file format is the worse performance
compared to multi-file indexes:
> In the book "Lucene in Action" it's said that compound file format is about 5-10% slower
than multi-file format.
> The patch I'm proposing here adds the ability to the IndexWriter to use compound format
only for segments, that do not contain more documents than a specific limit "CompoundFileSegmentSizeLimit",
which the user can set.
> Due to the exponential merges, a lucene index usually contains only a few very big segments,
but much more small segments. The best performance is actually just needed for the big segments,
whereas a slighly worse performance for small segments shouldn't play a big role in the overall
search performance.
> Consider the following example:
> Index Size:                            1,500,000
> Merge factor:                        10
> Max buffered docs:             100
> Number of indexed fields: 10
> Max. OS file descriptors:    1024
> in the worst case a not-optimized index could contain the following amount of segments:
> 1 x 1,000,000
> 9 x   100,000
> 9 x    10,000
> 9 x     1,000
> 9 x       100
> That's 37 segments. A multi-file format index would have:
> 37 segments * (7 files per segment + 10 files for indexed fields) = 629 files ==>
only about 2 open indexes per machine could be handled by the operating system
> A compound-file format index would have:
> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes could be handled by
the operating system, but performance would be 5-10% worse.
> A compound-file format index with CompoundFileSegmentSizeLimit = 1,000,000 would have:
> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 20 open indexes
could be handled by the OS
> The OS can handle now 20 instead of just 2 open indexes, while maintaining the multi-file
format performance.
> I'm going to create diffs on the current HEAD and will attach the patch files soon. Please
let me know what you think about this improvement.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message