incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: real time updates
Date Wed, 18 Mar 2009 22:41:25 GMT
Marvin Humphrey <> wrote:
> Mike McCandless:
>> It seems like redundant code to merge a run and to merge segments.
> There's not that much duplication any more.  The stable branch of KS
> can only merge postings and lexicon data by dumping everything into
> the external sorter, adding an inefficient extra step.  However, the
> dev branch can read directly directly from existing content.
> The runs which are merged by the external sorter are represented by
> individual "PostingPool" objects, which can be fed either from run
> data written during the present indexing session, or by live segment
> postings and lexicon files.  Flushing a run to disk uses the same
> format, so we only need one read class.

OK sounds like the code is well shared.

> I think that algorithmically, this means KS has moved towards Lucene -- though
> the class infrastructure is still compeletely different.

But at the end of the session you still insist on merging down to one
segment before closing right?  So if my added docs all fit in RAM,
closing is fast, but if I go a bit over 1 RAM buffer's worth, then
closing is suddenly slower since it must merge those two runs before

I think having a close that's unexpectedly slow is irksome.  I'd
prefer to pay my cost as I go...

Though, Lucene's close can also be unexpectedly slow, if a big BG
merge is still running and you have to wait.

>> I'm not sure I like tying single-search runtime concurrency directly
>> to the index structure (this was broached on a Jira issue somewhere).
>> I think I'd prefer to have each thread skipTo its starting docID and
>> process its chunk, instead of wiring one-thread-per-segment, but I'm
>> really not sure yet.
> That would work fine if Lucene indexes were block-based.  Does the
> format of the current "multi-level skip data" allow you to jump into
> the middle of a segment?  The Lucene file format documentation is
> gnarly in general, and the skip data spec was the gnarliest section
> last I looked, so I don't want to go check myself.  :) You wouldn't
> have been able to skip forward efficiently with the 1.4.3 format I
> managed to get working with KinoSearch 0.05, but maybe it would work
> now.

Gnarly is a good word to describe it.  That conjures up these vivid
images for me :)

The multi-level skip list should let you quickly jump to any spot
(though I haven't specifically tested this).

> The nice thing about your approach is that it wouldn't require any
> special intervention at the indexing stage.

Right, and you're free to change your mind at search time about how
much concurrency you want to spend.

> However, it would still be possible to tune the merge algorithm so
> that it tried to keep the segments mostly equal in size without
> requiring the user to call indexWriter.optimize(int numSegments).

True; they're not mutually exclusive.


View raw message