lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Date Fri, 23 Mar 2007 00:13:31 GMT

On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote:

> Actually is #2 a hard requirement?

A lot of Lucene users depend on having document number correspond to  
age, I think.  ISTR Hatcher at least recommending techniques that  
require it.

> Do the loose ports of Lucene
> (KinoSearch, Ferret, etc.) also follow this restriction?

KS: Nope.  So you can't use those tricks.

> I think instead of calling segments "level N" we should just measure
> their net sizes and merge on that basis?

Here's the fibonacci-series-based algorithm used in KinoSearch, taken  
from MultiReader:

sub segreaders_to_merge {
     my ( $self, $all ) = @_;
     return unless @{ $self->{sub_readers} };
     return @{ $self->{sub_readers} } if $all;

     # sort by ascending size in docs
     my @sorted_sub_readers
         = sort { $a->num_docs <=> $b->num_docs } @{ $self-> 
{sub_readers} };

     # find sparsely populated segments
     my $total_docs = 0;
     my $threshold  = -1;
     for my $i ( 0 .. $#sorted_sub_readers ) {
         $total_docs += $sorted_sub_readers[$i]->num_docs;
         if ( $total_docs < fibonacci( $i + 5 ) ) {
             $threshold = $i;
         }
     }

     # if any of the segments are sparse, return their readers
     if ( $threshold > -1 ) {
         return @sorted_sub_readers[ 0 .. $threshold ];
     }
     else {
         return;
     }
}

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message