lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Naomi Dushay" <Na...@cs.cornell.edu>
Subject sanity check - large, long running index updates and concurrent read-only service
Date Tue, 10 May 2005 20:45:30 GMT
Context:  our index is currently around 6 gig and takes about an hour just to
optimize.  Updating it, even in batches, can involve active updating for 15
or more minutes.

 

Index updates are done with two different batch processes as there are
currently two different workflows to update the index.  No obvious indexing
partitioning is suggested by our workflows.** The index is used in a read
only fashion by our REST search service, which runs under tomcat.

 

Issues:  we don't want our REST service to have slow or strange results while
the index is being updated.

 

 

Proposed solution:

 

The world starts with REST service pointing to index A .   Our REST service
is read only, so the index file(s) themselves can be read only.

 

To update the index, the following sequence occurs:

 

1.	lock index for update (via a lock/flag file)
2.	copy index A into another directory, and make it writable.  This new
index will become A'
3.	update index A', including optimization 
4.	set index A' to read only
5.	gracefully change REST service to point use index A';  verify that
REST service is working properly with A'.
6.	optional, but likely:  remove old, out-of-date index A.
7.	unlock index for update (remove lock/flag file)

 

The index updates could keep re-using the same two index directories, or they
could create new directories.

 

Does anyone see any problems or have any suggestions for how to improve this?


 

- Naomi Dushay

National Science Digital Library - Core Integration

Cornell University

 

** workflow 1:  we get bibliographic metadata that needs to go in the index.
This metadata may be new, or it may be an update to existing metadata.  The
metadata records refer to resource URLs.   Workflow 2:  we fetch the content
for the resource URLs, using Nutch as our crawler and content database.
Fetched content may be for a new URL or updated content for a known URL.  The
Lucene Documents are a combination of bibliographic metadata and fetched
content. 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message