lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-2312) Search on IndexWriter's RAM Buffer
Date Wed, 24 Aug 2011 22:53:31 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Rutherglen updated LUCENE-2312:
-------------------------------------

    Attachment: LUCENE-2312.patch

This is a revised version of the LUCENE-2312 patch.  The following are various and miscelaneous
notes pertaining to the patch and where it needs to go to be committed.  

Feel free to review the approach taken, eg, we're getting around non-realtime structures through
the usage of array copies (of which the arrays can be pooled at some point).

* A copy of FreqProxPostingsArray.termFreqs is made per new reader.  That array can be pooled.
 This is no different than the deleted docs BitVector which is created anew per-segment for
any deletes that have occurred.

* FreqProxPostingsArray freqUptosRT, proxUptosRT, lastDocIDsRT, lastDocFreqsRT is copied into,
per new reader (as opposed to an entirely new array instantiated for each new reader), this
is a slight optimization in object allocation.

* For deleting, a DWPT is clothed in an abstract class that exposes the necessary methods
from segment info, so that deletes may be applied to the RT RAM reader.  The deleting is still
performed in BufferedDeletesStream.  BitVectors are cloned as well.  There is room for improvement,
eg, pooling the BV byte[]’s.

* Documents (FieldsWriter) and term vectors are flushed on each get reader call, so that reading
will be able to load the data.  We will need to test if this is performant.  We are not creating
new files so this way of doing things may well be efficient.

* We need to measure the cost of the native system array copy.  It could very well be quite
fast / enough.

* Full posting functionality should be working including payloads

* Field caching may be implemented as a new field cache that is growable and enables lock’d
replacement of the underlying array

* String to string ordinal comparison caches needs to be figured out.  The RAM readers cannot
maintain a sorted terms index the way statically sized segments do

* When a field cache value is first being created, it needs to obtain the indexing lock on
the DWPT.  Otherwise documents will continue to be indexed, new values created, while the
array will miss the new values.  The downside is that while the array is initially being created,
indexing will stop.  This can probably be solved at some point by only locking during the
creation of the field cache array, and then notifying the DWPT of the new array.  New values
would then accumulate into the array from the point of the max doc of the reader the values
creator is working from.

* The terms dictionary is a ConcurrentSkipListMap.  We can periodically convert it into a
sorted [by term] int[], that has an FST on top.

Have fun reviewing! :)

> Search on IndexWriter's RAM Buffer
> ----------------------------------
>
>                 Key: LUCENE-2312
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2312
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/search
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Assignee: Michael Busch
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch, LUCENE-2312.patch
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max doc ids.
 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message