lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: [jira] Updated: (LUCENE-624) Segment size limit for compound files
Date Fri, 28 Jul 2006 00:55:59 GMT
Probably not during indexing, which is what Michael was referring to in his last email, if
I understood him correctly.
I suppose indexing with compound format would be a bit slower because individual index files
will have to be compounded in a .cfs file, and that'll consume a bit of extra time.

Otis

----- Original Message ----
From: robert engels <rengels@ix.netcom.com>
To: java-dev@lucene.apache.org
Sent: Thursday, July 27, 2006 8:48:53 PM
Subject: Re: [jira] Updated: (LUCENE-624) Segment size limit for compound files

In my experience, the more segment files the worse the performance  
(thus the optimize method).

On Jul 27, 2006, at 7:44 PM, Michael Busch wrote:

> robert engels wrote:
>> Why does more segment files improve search performance? I can see  
>> that if you have many smaller files, the merge process for  
>> incremental adds might be faster, but more segments should  
>> actually make searching slower.
> Robert,
>
> I did not run my own performance experiments, but after reading  
> come threads about compound performance again I think you are  
> right. Compound file format does not affect search performance  
> significantly, but it slows down indexing time by 5-10%. So this  
> tiny patch should improve indexing speed while keeping the number  
> of segment files relatively low. If I find some time I will run  
> performance experiments to get some numbers.
>
> Michael
>
>> On Jul 26, 2006, at 5:18 PM, Michael Busch (JIRA) wrote:
>>
>>>      [ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]
>>>
>>> Michael Busch updated LUCENE-624:
>>> ---------------------------------
>>>
>>>     Attachment: cfs_seg_size_limit.patch
>>>
>>> I attach the patch file for this improvement.
>>>
>>> This patch adds two new methods to the API of IndexWriter and  
>>> IndexModifier:
>>>   /** Get the current value of the compound file segment size limit.
>>>    *  Note that this just returns the value you set with  
>>> setCompoundFileSegmentSizeLimit(int)
>>>    *  or the default. You cannot use this to query the status of  
>>> an existing index.
>>>    *  @see #setCompoundFileSegmentSizeLimit(int)
>>>    */
>>>   public int getCompoundFileSegmentSizeLimit();
>>>
>>>   /** Sets the limit of documents a segment can have, so that
>>>    *  compound format is being used for that segment. A high
>>>    *  limit will decrease the number of files per index, whereas
>>>    *  a lower limit will improve search performance but
>>>    *  increase the number of files.
>>>    */
>>>   public void setCompoundFileSegmentSizeLimit(int value);
>>>
>>> Furthermore I added a constant to IndexWriter:
>>> public final static int DEFAULT_COMPOUND_FILE_SEGMENT_SIZE_LIMIT  
>>> = Integer.MAX_VALUE;
>>>
>>> Since the default value is set to Integer.MAX_VALUE, the behavior  
>>> of IndexWriter/IndexModifier only changes if the user uses  
>>> setCompoundFileSegmentSizeLimit(int) to change the value explicitly.
>>>
>>>> Segment size limit for compound files
>>>> -------------------------------------
>>>>
>>>>                 Key: LUCENE-624
>>>>                 URL: http://issues.apache.org/jira/browse/ 
>>>> LUCENE-624
>>>>             Project: Lucene - Java
>>>>          Issue Type: Improvement
>>>>          Components: Index
>>>>            Reporter: Michael Busch
>>>>            Priority: Minor
>>>>         Attachments: cfs_seg_size_limit.patch
>>>>
>>>>
>>>> Hello everyone,
>>>> I implemented an improvement targeting compound file usage.  
>>>> Compound files are used to decrease the number of index files,  
>>>> because operating systems can't handle too many open file  
>>>> descriptors. On the other hand, a disadvantage of compound file  
>>>> format is the worse performance compared to multi-file indexes:
>>>> http://www.gossamer-threads.com/lists/lucene/java-user/8950
>>>> In the book "Lucene in Action" it's said that compound file  
>>>> format is about 5-10% slower than multi-file format.
>>>> The patch I'm proposing here adds the ability to the IndexWriter  
>>>> to use compound format only for segments, that do not contain  
>>>> more documents than a specific limit  
>>>> "CompoundFileSegmentSizeLimit", which the user can set.
>>>> Due to the exponential merges, a lucene index usually contains  
>>>> only a few very big segments, but much more small segments. The  
>>>> best performance is actually just needed for the big segments,  
>>>> whereas a slighly worse performance for small segments shouldn't  
>>>> play a big role in the overall search performance.
>>>> Consider the following example:
>>>> Index Size:                            1,500,000
>>>> Merge factor:                        10
>>>> Max buffered docs:             100
>>>> Number of indexed fields: 10
>>>> Max. OS file descriptors:    1024
>>>> in the worst case a not-optimized index could contain the  
>>>> following amount of segments:
>>>> 1 x 1,000,000
>>>> 9 x   100,000
>>>> 9 x    10,000
>>>> 9 x     1,000
>>>> 9 x       100
>>>> That's 37 segments. A multi-file format index would have:
>>>> 37 segments * (7 files per segment + 10 files for indexed  
>>>> fields) = 629 files ==> only about 2 open indexes per machine  
>>>> could be handled by the operating system
>>>> A compound-file format index would have:
>>>> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes  
>>>> could be handled by the operating system, but performance would  
>>>> be 5-10% worse.
>>>> A compound-file format index with CompoundFileSegmentSizeLimit =  
>>>> 1,000,000 would have:
>>>> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==>  
>>>> about 20 open indexes could be handled by the OS
>>>> The OS can handle now 20 instead of just 2 open indexes, while  
>>>> maintaining the multi-file format performance.
>>>> I'm going to create diffs on the current HEAD and will attach  
>>>> the patch files soon. Please let me know what you think about  
>>>> this improvement.
>>>
>>> --This message is automatically generated by JIRA.
>>> -
>>> If you think it was sent incorrectly contact one of the  
>>> administrators: http://issues.apache.org/jira/secure/ 
>>> Administrators.jspa
>>> -
>>> For more information on JIRA, see: http://www.atlassian.com/ 
>>> software/jira
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message