lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexandru Popescu" <the.mindstorm.mailingl...@gmail.com>
Subject Re: [Fwd: [PROPOSAL] index server project]
Date Thu, 19 Oct 2006 14:19:00 GMT
I am not sure this is (somehow) related, but I think I have noticed
some project on a Sun contest (it was the big prize winner). I cannot
retrieve it now, but hopefully somebody else will.

./alex
--
.w( the_mindstorm )p.


On 10/19/06, Stefan Groschupf <sg@101tec.com> wrote:
> Hi Doug,
>
> we discussed the need of such a tool several times internally and
> developed some workarounds for nutch, so I would be definitely
> interested to contribute to such a project.
> Having a separated project that depends on hadoop would be the best
> case for our usecases.
>
> Best,
> Stefan
>
>
>
> Am 18.10.2006 um 23:35 schrieb Doug Cutting:
>
> > FYI, I just pitched a new project you might be interested in on
> > general@lucene.com.  Dunno if you subscribe to that list, so I'm
> > spamming you.  If it sounds interesting, please reply there.  My
> > management at Y! is interested in this, so I'm 'in'.
> >
> > Doug
> >
> > -------- Original Message --------
> > Subject: [PROPOSAL] index server project
> > Date: Wed, 18 Oct 2006 14:17:30 -0700
> > From: Doug Cutting <cutting@apache.org>
> > Reply-To: general@lucene.apache.org
> > To: general@lucene.apache.org
> >
> > It seems that Nutch and Solr would benefit from a shared index serving
> > infrastructure.  Other Lucene-based projects might also benefit from
> > this.  So perhaps we should start a new project to build such a thing.
> > This could start either in java/contrib, or as a separate sub-project,
> > depending on interest.
> >
> > Here are some quick ideas about how this might work.
> >
> > An RPC mechanism would be used to communicate between nodes (probably
> > Hadoop's).  The system would be configured with a single master node
> > that keeps track of where indexes are located, and a number of slave
> > nodes that would maintain, search and replicate indexes.  Clients
> > would
> > talk to the master to find out which indexes to search or update, then
> > they'll talk directly to slaves to perform searches and updates.
> >
> > Following is an outline of how this might look.
> >
> > We assume that, within an index, a file with a given name is written
> > only once.  Index versions are sets of files, and a new version of an
> > index is likely to share most files with the prior version.  Versions
> > are numbered.  An index server should keep old versions of each index
> > for a while, not immediately removing old files.
> >
> > public class IndexVersion {
> >   String Id;   // unique name of the index
> >   int version; // the version of the index
> > }
> >
> > public class IndexLocation {
> >   IndexVersion indexVersion;
> >   InetSocketAddress location;
> > }
> >
> > public interface ClientToMasterProtocol {
> >   IndexLocation[] getSearchableIndexes();
> >   IndexLocation getUpdateableIndex(String id);
> > }
> >
> > public interface ClientToSlaveProtocol {
> >   // normal update
> >   void addDocument(String index, Document doc);
> >   int[] removeDocuments(String index, Term term);
> >   void commitVersion(String index);
> >
> >   // batch update
> >   void addIndex(String index, IndexLocation indexToAdd);
> >
> >   // search
> >   SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> > }
> >
> > public interface SlaveToMasterProtocol {
> >   // sends currently searchable indexes
> >   // recieves updated indexes that we should replicate/update
> >   public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> > }
> >
> > public interface SlaveToSlaveProtocol {
> >   String[] getFileSet(IndexVersion indexVersion);
> >   byte[] getFileContent(IndexVersion indexVersion, String file);
> >   // based on experience in Hadoop, we probably wouldn't really use
> >   // RPC to send file content, but rather HTTP.
> > }
> >
> > The master thus maintains the set of indexes that are available for
> > search, keeps track of which slave should handle changes to an
> > index and
> > initiates index synchronization between slaves.  The master can be
> > configured to replicate indexes a specified number of times.
> >
> > The client library can cache the current set of searchable indexes and
> > periodically refresh it.  Searches are broadcast to one index with
> > each
> > id and return merged results.  The client will load-balance both
> > searches and updates.
> >
> > Deletions could be broadcast to all slaves.  That would probably be
> > fast
> > enough.  Alternately, indexes could be partitioned by a hash of each
> > document's unique id, permitting deletions to be routed to the
> > appropriate slave.
> >
> > Does this make sense?  Does it sound like it would be useful to Solr?
> > To Nutch?  To others?  Who would be interested and able to work on it?
> >
> > Doug
> >
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 101tec Inc.
> search tech for web 2.1
> Menlo Park, California
> http://www.101tec.com
>
>
>
>
>

Mime
View raw message