lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf ...@101tec.com>
Subject Re: [Fwd: [PROPOSAL] index server project]
Date Thu, 19 Oct 2006 13:55:03 GMT
Hi Doug,

we discussed the need of such a tool several times internally and  
developed some workarounds for nutch, so I would be definitely  
interested to contribute to such a project.
Having a separated project that depends on hadoop would be the best  
case for our usecases.

Best,
Stefan



Am 18.10.2006 um 23:35 schrieb Doug Cutting:

> FYI, I just pitched a new project you might be interested in on  
> general@lucene.com.  Dunno if you subscribe to that list, so I'm  
> spamming you.  If it sounds interesting, please reply there.  My  
> management at Y! is interested in this, so I'm 'in'.
>
> Doug
>
> -------- Original Message --------
> Subject: [PROPOSAL] index server project
> Date: Wed, 18 Oct 2006 14:17:30 -0700
> From: Doug Cutting <cutting@apache.org>
> Reply-To: general@lucene.apache.org
> To: general@lucene.apache.org
>
> It seems that Nutch and Solr would benefit from a shared index serving
> infrastructure.  Other Lucene-based projects might also benefit from
> this.  So perhaps we should start a new project to build such a thing.
> This could start either in java/contrib, or as a separate sub-project,
> depending on interest.
>
> Here are some quick ideas about how this might work.
>
> An RPC mechanism would be used to communicate between nodes (probably
> Hadoop's).  The system would be configured with a single master node
> that keeps track of where indexes are located, and a number of slave
> nodes that would maintain, search and replicate indexes.  Clients  
> would
> talk to the master to find out which indexes to search or update, then
> they'll talk directly to slaves to perform searches and updates.
>
> Following is an outline of how this might look.
>
> We assume that, within an index, a file with a given name is written
> only once.  Index versions are sets of files, and a new version of an
> index is likely to share most files with the prior version.  Versions
> are numbered.  An index server should keep old versions of each index
> for a while, not immediately removing old files.
>
> public class IndexVersion {
>   String Id;   // unique name of the index
>   int version; // the version of the index
> }
>
> public class IndexLocation {
>   IndexVersion indexVersion;
>   InetSocketAddress location;
> }
>
> public interface ClientToMasterProtocol {
>   IndexLocation[] getSearchableIndexes();
>   IndexLocation getUpdateableIndex(String id);
> }
>
> public interface ClientToSlaveProtocol {
>   // normal update
>   void addDocument(String index, Document doc);
>   int[] removeDocuments(String index, Term term);
>   void commitVersion(String index);
>
>   // batch update
>   void addIndex(String index, IndexLocation indexToAdd);
>
>   // search
>   SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> }
>
> public interface SlaveToMasterProtocol {
>   // sends currently searchable indexes
>   // recieves updated indexes that we should replicate/update
>   public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> }
>
> public interface SlaveToSlaveProtocol {
>   String[] getFileSet(IndexVersion indexVersion);
>   byte[] getFileContent(IndexVersion indexVersion, String file);
>   // based on experience in Hadoop, we probably wouldn't really use
>   // RPC to send file content, but rather HTTP.
> }
>
> The master thus maintains the set of indexes that are available for
> search, keeps track of which slave should handle changes to an  
> index and
> initiates index synchronization between slaves.  The master can be
> configured to replicate indexes a specified number of times.
>
> The client library can cache the current set of searchable indexes and
> periodically refresh it.  Searches are broadcast to one index with  
> each
> id and return merged results.  The client will load-balance both
> searches and updates.
>
> Deletions could be broadcast to all slaves.  That would probably be  
> fast
> enough.  Alternately, indexes could be partitioned by a hash of each
> document's unique id, permitting deletions to be routed to the
> appropriate slave.
>
> Does this make sense?  Does it sound like it would be useful to Solr?
> To Nutch?  To others?  Who would be interested and able to work on it?
>
> Doug
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message