lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: improve how IndexWriter uses RAM to buffer added documents
Date Thu, 05 Apr 2007 19:06:09 GMT

"Marvin Humphrey" <> wrote:

> > (I think for KS you "add" a previous segment not that
> > differently from how you "add" a document)?
> Yeah.  KS has to decompress and serialize posting content, which sux.
> The one saving grace is that with the Fibonacci merge schedule and  
> the seg-at-a-time indexing strategy, segments don't get merged nearly  
> as often as they do in Lucene.

Yeah we need to work on this one.  One thing that irks me about the
current Lucene merge policy (besides that it gets confused when you
flush-by-RAM-usage) is that it's a "pay it forward" design so you're
alwa>ys over-paying when you build a given index size.  With KS's
Fibonacci merge policy, you don't.  LUCENE-854 has some more details.

Segment merging really is costly.  In building a large (86 GB, 10 MM
docs) index, 65.6% of the time was spent merging!  Details are in

> > On C) I think it is important so the many ports of Lucene can "compare
> > notes" and "cross fertilize".
> Well, if you port Lucene's benchmarking stuff to Perl/C, I'll apply  
> the patch. ;)

I hear you!

> Cross-fertilization is a powerful tool for stimulating algorithmic  
> innovation.  Exhibit A: our unfolding collaborative successes.

Couldn't agree more.

> That's why it was built into the Lucy proposal:
>      [Lucy's C engine] will provide core, performance-critical
>      functionality, but leave as much up to the higher-level
>      language as possible.
> Users from diverse communities approach problems from different  
> angles and come up with different solutions.  The best ones will  
> propagate across Lucy bindings.
> The only problem is that since Dave Balmain has been much less  
> available than we expected, it's been largely up to me to get Lucy to  
> critical mass where other people can start writing bindings.

This is a great model.  Are there Python bindings to Lucy yet/coming?

> > But does KS give its users a choice in Tokenizer?
> You supply a regular expression which matches one token.
>    # Presto! A WhiteSpaceTokenizer:
>    my $tokenizer = KinoSearch::Analysis::Tokenizer->new(
>        token_re => qr/\S+/
>    );
> > Or, can users pre-tokenize their fields themselves?
> TokenBatch provides an API for bulk addition of tokens; you can  
> subclass Analyzer to exploit that.

Ahh, I get it.  Nice!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message