lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen" <>
Subject Re: Realtime Search for Social Networks Collaboration
Date Thu, 18 Sep 2008 13:44:43 GMT

The other issue that will occur that I addressed is the field caches.
The underlying smaller IndexReaders will need to be exposed because of
the field caching.  Currently in ocean realtime search the individual
readers are searched on using a MultiSearcher in order to search in
parallel and reuse the field caches. How will field caching work with
the IndexWriter approach?  It seems like it would need a dynamically
growing field cache array?  That is a bit tricky.  By doing in memory
merging in ocean, the field caches last longer and do not require
growing arrays.  How do you plan to handle rapidly delete the docs of
the disk segments?  Can the SegmentReader clone patch be used for


On Thu, Sep 11, 2008 at 8:29 AM, Michael McCandless
<> wrote:
> Right, there would need to be a snapshot taken of all terms when
> IndexWriter.getReader() is called.
> This snapshot would 1) hold a frozen int docFreq per term, and 2) sort the
> terms so TermEnum can just step through them.  (We might be able to delay
> this sorting until the first time something asks for it).  Also, it must
> merge this data from all threads, since each thread holds its hash per
> field.  I've got a rough start at coding this up...
> The costs are clearly growing, in order to keep the "point in time" feature
> of this RAMIndexReader, but I think are still well contained unless you have
> a really huge RAM buffer.
> Flushing is still tricky because we cannot recycle the byte block buffers
> until all running TermDocs/TermPositions iterations are "finished".
>  Alternatively, I may just allocate new byte blocks and allow the old ones
> to be GC'd on their own once running iterations are finished.
> Mike
> Jason Rutherglen wrote:
>> Hi Mike,
>> There would be a new sorted list or something to replace the
>> hashtable?  Seems like an issue that is not solved.
>> Jason
>> On Tue, Sep 9, 2008 at 5:29 AM, Michael McCandless
>> <> wrote:
>>> This would just tap into the live hashtable that DocumentsWriter*
>>> maintain
>>> for the posting lists... except the docFreq will need to be copied away
>>> on
>>> reopen, I think.
>>> Mike
>>> Jason Rutherglen wrote:
>>>> Term dictionary?  I'm curious how that would be solved?
>>>> On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
>>>> <> wrote:
>>>>> Yonik Seeley wrote:
>>>>>>> I think it's quite feasible, but, it'd still have a "reopen"
cost in
>>>>>>> that
>>>>>>> any buffered delete by term or query would have to be "materialiazed"
>>>>>>> into
>>>>>>> docIDs on reopen.  Though, if this somehow turns out to be a
>>>>>>> in
>>>>>>> the
>>>>>>> future we could do this materializing immediately, instead of
>>>>>>> buffering,
>>>>>>> if
>>>>>>> we already have a reader open.
>>>>>> Right... it seems like re-using readers internally is something we
>>>>>> could already be doing in IndexWriter.
>>>>> True.
>>>>>>> Flushing is somewhat tricky because any open RAM readers would
>>>>>>> have
>>>>>>> to
>>>>>>> cutover to the newly flushed segment once the flush completes,
>>>>>>> that
>>>>>>> the
>>>>>>> RAM buffer can be recycled for the next segment.
>>>>>> Re-use of a RAM buffer doesn't seem like such a big deal.
>>>>>> But, how would you maintain a static view of an index...?
>>>>>> IndexReader r1 = indexWriter.getCurrentIndex()
>>>>>> indexWriter.addDocument(...)
>>>>>> IndexReader r2 = indexWriter.getCurrentIndex()
>>>>>> I assume r1 will have a view of the index before the document was
>>>>>> added, and r2 after?
>>>>> Right, getCurrentIndex would return a MultiReader that includes
>>>>> SegmentReader for each segment in the index, plus a "RAMReader" that
>>>>> searches the RAM buffer.  That RAMReader is a tiny shell class that
>>>>> would
>>>>> basically just record the max docID it's allowed to go up to (the docID
>>>>> as
>>>>> of when it was opened), and stop enumerating docIDs (eg in the
>>>>> TermDocs)
>>>>> when it hits a docID beyond that limit.
>>>>> For reading stored fields and term vectors, which are now flushed
>>>>> immediately to disk, we need to somehow get an IndexInput from the
>>>>> IndexOutputs that IndexWriter holds open on these files.  Or, maybe,
>>>>> just
>>>>> open new IndexInputs?
>>>>>> Another thing that will help is if users could get their hands on
>>>>>> sub-readers of a multi-segment reader.  Right now that is hidden
>>>>>> MultiSegmentReader and makes updating anything incrementally
>>>>>> difficult.
>>>>> Besides what's handled by MultiSegmentReader.reopen already, what else
>>>>> do
>>>>> you need to incrementally update?
>>>>> Mike
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail:
>>>>> For additional commands, e-mail:
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message