lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject [PROPOSAL] index server project
Date Wed, 18 Oct 2006 21:17:30 GMT
It seems that Nutch and Solr would benefit from a shared index serving 
infrastructure.  Other Lucene-based projects might also benefit from 
this.  So perhaps we should start a new project to build such a thing. 
This could start either in java/contrib, or as a separate sub-project, 
depending on interest.

Here are some quick ideas about how this might work.

An RPC mechanism would be used to communicate between nodes (probably 
Hadoop's).  The system would be configured with a single master node 
that keeps track of where indexes are located, and a number of slave 
nodes that would maintain, search and replicate indexes.  Clients would 
talk to the master to find out which indexes to search or update, then 
they'll talk directly to slaves to perform searches and updates.

Following is an outline of how this might look.

We assume that, within an index, a file with a given name is written 
only once.  Index versions are sets of files, and a new version of an 
index is likely to share most files with the prior version.  Versions 
are numbered.  An index server should keep old versions of each index 
for a while, not immediately removing old files.

public class IndexVersion {
   String Id;   // unique name of the index
   int version; // the version of the index
}

public class IndexLocation {
   IndexVersion indexVersion;
   InetSocketAddress location;
}

public interface ClientToMasterProtocol {
   IndexLocation[] getSearchableIndexes();
   IndexLocation getUpdateableIndex(String id);
}

public interface ClientToSlaveProtocol {
   // normal update
   void addDocument(String index, Document doc);
   int[] removeDocuments(String index, Term term);
   void commitVersion(String index);

   // batch update
   void addIndex(String index, IndexLocation indexToAdd);

   // search
   SearchResults search(IndexVersion i, Query query, Sort sort, int n);
}

public interface SlaveToMasterProtocol {
   // sends currently searchable indexes
   // recieves updated indexes that we should replicate/update
   public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
}

public interface SlaveToSlaveProtocol {
   String[] getFileSet(IndexVersion indexVersion);
   byte[] getFileContent(IndexVersion indexVersion, String file);
   // based on experience in Hadoop, we probably wouldn't really use
   // RPC to send file content, but rather HTTP.
}

The master thus maintains the set of indexes that are available for 
search, keeps track of which slave should handle changes to an index and 
initiates index synchronization between slaves.  The master can be 
configured to replicate indexes a specified number of times.

The client library can cache the current set of searchable indexes and 
periodically refresh it.  Searches are broadcast to one index with each 
id and return merged results.  The client will load-balance both 
searches and updates.

Deletions could be broadcast to all slaves.  That would probably be fast 
enough.  Alternately, indexes could be partitioned by a hash of each 
document's unique id, permitting deletions to be routed to the 
appropriate slave.

Does this make sense?  Does it sound like it would be useful to Solr? 
To Nutch?  To others?  Who would be interested and able to work on it?

Doug

Mime
View raw message