incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: real time updates
Date Wed, 18 Mar 2009 19:59:54 GMT
Mike McCandless: 

> It seems like redundant code to merge a run and to merge segments.  

There's not that much duplication any more.  The stable branch of KS can only
merge postings and lexicon data by dumping everything into the external sorter,
adding an inefficient extra step.  However, the dev branch can read directly
directly from existing content.  

The runs which are merged by the external sorter are represented by individual
"PostingPool" objects, which can be fed either from run data written during the
present indexing session, or by live segment postings and lexicon files.
Flushing a run to disk uses the same format, so we only need one read class.

I think that algorithmically, this means KS has moved towards Lucene -- though
the class infrastructure is still compeletely different.

> I'm not sure I like tying single-search runtime concurrency directly
> to the index structure (this was broached on a Jira issue somewhere).
> I think I'd prefer to have each thread skipTo its starting docID and
> process its chunk, instead of wiring one-thread-per-segment, but I'm
> really not sure yet.

That would work fine if Lucene indexes were block-based.  Does the format of
the current "multi-level skip data" allow you to jump into the middle of a
segment?  The Lucene file format documentation is gnarly in general, and the
skip data spec was the gnarliest section last I looked, so I don't want to go
check myself.  :) You wouldn't have been able to skip forward efficiently with
the 1.4.3 format I managed to get working with KinoSearch 0.05, but maybe it
would work now.

The nice thing about your approach is that it wouldn't require any special
intervention at the indexing stage.  However, it would still be possible to
tune the merge algorithm so that it tried to keep the segments mostly equal in
size without requiring the user to call indexWriter.optimize(int numSegments).

Marvin Humphrey

View raw message