lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <busch...@gmail.com>
Subject Re: [jira] Updated: (LUCENE-624) Segment size limit for compound files
Date Fri, 28 Jul 2006 00:44:52 GMT
robert engels wrote:
> Why does more segment files improve search performance? I can see that 
> if you have many smaller files, the merge process for incremental adds 
> might be faster, but more segments should actually make searching slower.
Robert,

I did not run my own performance experiments, but after reading come 
threads about compound performance again I think you are right. Compound 
file format does not affect search performance significantly, but it 
slows down indexing time by 5-10%. So this tiny patch should improve 
indexing speed while keeping the number of segment files relatively low. 
If I find some time I will run performance experiments to get some numbers.

Michael

> On Jul 26, 2006, at 5:18 PM, Michael Busch (JIRA) wrote:
>
>>      [ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]
>>
>> Michael Busch updated LUCENE-624:
>> ---------------------------------
>>
>>     Attachment: cfs_seg_size_limit.patch
>>
>> I attach the patch file for this improvement.
>>
>> This patch adds two new methods to the API of IndexWriter and 
>> IndexModifier:
>>   /** Get the current value of the compound file segment size limit.
>>    *  Note that this just returns the value you set with 
>> setCompoundFileSegmentSizeLimit(int)
>>    *  or the default. You cannot use this to query the status of an 
>> existing index.
>>    *  @see #setCompoundFileSegmentSizeLimit(int)
>>    */
>>   public int getCompoundFileSegmentSizeLimit();
>>
>>   /** Sets the limit of documents a segment can have, so that
>>    *  compound format is being used for that segment. A high
>>    *  limit will decrease the number of files per index, whereas
>>    *  a lower limit will improve search performance but
>>    *  increase the number of files.
>>    */
>>   public void setCompoundFileSegmentSizeLimit(int value);
>>
>> Furthermore I added a constant to IndexWriter:
>> public final static int DEFAULT_COMPOUND_FILE_SEGMENT_SIZE_LIMIT = 
>> Integer.MAX_VALUE;
>>
>> Since the default value is set to Integer.MAX_VALUE, the behavior of 
>> IndexWriter/IndexModifier only changes if the user uses 
>> setCompoundFileSegmentSizeLimit(int) to change the value explicitly.
>>
>>> Segment size limit for compound files
>>> -------------------------------------
>>>
>>>                 Key: LUCENE-624
>>>                 URL: http://issues.apache.org/jira/browse/LUCENE-624
>>>             Project: Lucene - Java
>>>          Issue Type: Improvement
>>>          Components: Index
>>>            Reporter: Michael Busch
>>>            Priority: Minor
>>>         Attachments: cfs_seg_size_limit.patch
>>>
>>>
>>> Hello everyone,
>>> I implemented an improvement targeting compound file usage. Compound 
>>> files are used to decrease the number of index files, because 
>>> operating systems can't handle too many open file descriptors. On 
>>> the other hand, a disadvantage of compound file format is the worse 
>>> performance compared to multi-file indexes:
>>> http://www.gossamer-threads.com/lists/lucene/java-user/8950
>>> In the book "Lucene in Action" it's said that compound file format 
>>> is about 5-10% slower than multi-file format.
>>> The patch I'm proposing here adds the ability to the IndexWriter to 
>>> use compound format only for segments, that do not contain more 
>>> documents than a specific limit "CompoundFileSegmentSizeLimit", 
>>> which the user can set.
>>> Due to the exponential merges, a lucene index usually contains only 
>>> a few very big segments, but much more small segments. The best 
>>> performance is actually just needed for the big segments, whereas a 
>>> slighly worse performance for small segments shouldn't play a big 
>>> role in the overall search performance.
>>> Consider the following example:
>>> Index Size:                            1,500,000
>>> Merge factor:                        10
>>> Max buffered docs:             100
>>> Number of indexed fields: 10
>>> Max. OS file descriptors:    1024
>>> in the worst case a not-optimized index could contain the following 
>>> amount of segments:
>>> 1 x 1,000,000
>>> 9 x   100,000
>>> 9 x    10,000
>>> 9 x     1,000
>>> 9 x       100
>>> That's 37 segments. A multi-file format index would have:
>>> 37 segments * (7 files per segment + 10 files for indexed fields) = 
>>> 629 files ==> only about 2 open indexes per machine could be handled 
>>> by the operating system
>>> A compound-file format index would have:
>>> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes could 
>>> be handled by the operating system, but performance would be 5-10% 
>>> worse.
>>> A compound-file format index with CompoundFileSegmentSizeLimit = 
>>> 1,000,000 would have:
>>> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 
>>> 20 open indexes could be handled by the OS
>>> The OS can handle now 20 instead of just 2 open indexes, while 
>>> maintaining the multi-file format performance.
>>> I'm going to create diffs on the current HEAD and will attach the 
>>> patch files soon. Please let me know what you think about this 
>>> improvement.
>>
>> --This message is automatically generated by JIRA.
>> -
>> If you think it was sent incorrectly contact one of the 
>> administrators: http://issues.apache.org/jira/secure/Administrators.jspa
>> -
>> For more information on JIRA, see: 
>> http://www.atlassian.com/software/jira
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message