lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Engels" <reng...@ix.netcom.com>
Subject RE: GData Server - Lucene storage
Date Fri, 02 Jun 2006 16:07:27 GMT
What we've done is that if the number of incoming documents is less than
some threshold, we serialize the documents to a "pending" file instead of
using the IndexWriter. If it greater than the threshold it is assumed an
index rebuild is occurring and so the updates are passed directly to the
IndexWriter.

We always process the pending file before any queries. This allows for rapid
index updates, and transactional control, since the updates can be batched,
and if the server crashes, the pending updates are available in the
"pending" file. The pending file is deleted after successful processing.

A possible improvement would be to create a RAMDirectory of the pending file
as well, and then perform the query on the RAMdirectory in addition to the
main directory. Since our documents are uniquely "keyed", we can eliminate
matches in the main directory for those that exist in the RAMDirectory (or
have been deleted in the RAMDirectory) fairly efficiently. If the number of
updates is low, this would improve the search latency for the readers.

-----Original Message-----
From: Yonik Seeley [mailto:yseeley@gmail.com] 
Sent: Friday, June 02, 2006 10:55 AM
To: java-dev@lucene.apache.org
Subject: Re: GData Server - Lucene storage

On 6/2/06, Simon Willnauer <simon.willnauer@googlemail.com> wrote:
> This is also true. This problem is still the server response, if i 
> queue some updates / inserts or index them into a RamDir I still have 
> the problem of concurrent indexing. The client should wait for the 
> writing process to finish correctly otherwise the reponse should be 
> some Error 500. If the client will not wait (be hold) there is a risk of a
lost update.
> The same problem appears in indexing entries into the search index. 
> There won't be a lot of inserts and update concurrent so  I can't wait 
> for other inserts to do batch indexing. I could index them into 
> ramDirs and search multiple indexes. but what happens if the server 
> crashes with a certain amount of entries indexed into a ramDir?
>
> any solutions for that in the solr project?

But the problem is twofold:
 1) You can't freely mix adds and deletes in Lucene.
 2) changes are not immediately visible... you need to close the current
writer and open a new IndexSearcher, which are relatively heavyweight
operations.

Solr solved (1) by adding all documents immediately as they come in (using
the same thread as the client request).  Deletes are replied to immediately,
but are defered.  When a "commit" happens, the writer is closed, a new
reader is opened, and all the deletes are processed.
Then a new IndexSearcher is opened, making all the adds and deletes visible.

Solr doesn't do anything to solve (2).  It's main focus has been providing
high throughput and low latency queries, not on the "freshness" of updates.

Decoupling the indexing from storage might help if new additions don't need
to be searchable (but do need to be retrievable by id)... you could make
storage synchronous, but batch the adds/deletes in some manner and open a
new IndexSearcher less frequently.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message