hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Trivial Update of "DistributedLucene" by MarkButler
Date Wed, 19 Dec 2007 11:06:22 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by MarkButler:
http://wiki.apache.org/lucene-hadoop/DistributedLucene

------------------------------------------------------------------------------
  {{{
  public interface ClientToDataNodeProtocol extends VersionedProtocol {
    void addDocument(String index, Document doc) throws IOException;
+   int removeDocuments(String index, Term term) throws IOException; // Change here, Doug
suggested int[] but that is different to current Lucene API
- 
-   // Change here, Doug suggested int[] but that is different
-   // to current Lucene API
- 
-   int removeDocuments(String index, Term term) throws IOException;
    IndexVersion commitVersion(String index) throws IOException;
  
    // batch update
  
-   void addIndex(String index) throws IOException;
+   void addIndex(String index) throws IOException; // Shouldn't this be called createIndex
?
    void addIndex(String index, IndexLocation indexToAdd) throws IOException;
  
    // search
@@ -67, +63 @@

  {{{
  public interface DataNodeToDataNodeProtocol extends VersionedProtocol {
    String[] getFileSet(IndexVersion indexVersion) throws IOException;
+   byte[] getFileContent(IndexVersion indexVersion, String file) throws IOException; // based
on experience in Hadoop we probably wouldn't really use RPC to find file content, instead
HTTP
-   byte[] getFileContent(IndexVersion indexVersion, String file)
-       throws IOException;
-   // based on experience in Hadoop we probably wouldn't really use
-   // RPC to find file content, instead HTTP
  }
  }}}
  
@@ -97, +90 @@

  
  Design the client API. 
  
+ One of the issues here is whether sharding should be handled solely at the client, using
the API defined above. For example you could have myindex-1, myindex-2 and myindex-3 are the
shards of my-index. However then the client takes responsibility for sharding, and the Master
and Workers know nothing about it. The other approach would be to extend the API outlined
above so that it knows about shards, so that the workers store metadata about the relationship
between shards, which is then sent to the master, so the client can query it rather than inferring
it. 
+ 
+ To insert data, use a consistent hashing algorithm as described here http://problemsworthyofattack.blogspot.com/2007/11/consistent-hashing.html
+ 
+ Then provide a query operation which calls all the shards.
+ 
+ Here is a proposal for the client API:
+ 
+ {{{
+ public interface ClientAPI {
+ 
+   void createIndex(String index, boolean sharded) throws IOException;
+ 
+   // Use IndexVersion because the client API does not need to know where the data is
+ 
+   IndexVersion[] getSearchableIndexes();
+   IndexVersion[] getUpdateableIndexes();
+   void addIndex(String index, IndexVersion indexToAdd) throws IOException;
+   void addDocument(String index, Document doc) throws IOException;
+   int removeDocuments(String index, Term term) throws IOException; // Change here, Doug
suggested int[] but that is different to current Lucene API
+   IndexVersion commit(String index) throws IOException;
+   SearchResults search(IndexVersion i, Query query, Sort sort, int n) throws IOException;
+ }
+ }}}
+ 

Mime
View raw message