lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "OceanRealtimeSearch" by JasonRutherglen
Date Thu, 04 Sep 2008 04:12:05 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by JasonRutherglen:
http://wiki.apache.org/lucene-java/OceanRealtimeSearch

------------------------------------------------------------------------------
  
  = Tag Index =
  
- The tag index patch is located at [https://issues.apache.org/jira/browse/LUCENE-1292 LUCENE-1292].
 I had seen people mention using a ParallelReader to have an index that is static and an index
that is dynamic appear as one index.  The challenge with this type of system is to get the
doc numbers to stay aligned.  Google seems to have a realtime tag index system.  I figured
there must be some way using the Lucene architecture to achieve the same thing.  The method
I came up with is to divide the postings list into blocks.  Each block contains a set number
of documents, the blocks are not divided by actual byte size but by document number.  The
blocks are unified using a TagMultiTermDocs class.  When a block is changed it is written
to RAM.  Once the RAM usage hits a certain size, the disk and memory postings are merged to
disk.  There needs to be coordination between this process and the merging of the segments.
 Each Tag Index is associated with a segment.  In Ocean the mer
 ging of segments is performed by the Ocean code and not IndexWriter so the coordination does
not involve hooking into IndexWriter.  Currently there needs to be a way to obtain the doc
id from an addDocument call from IndexWriter which needs a patch still.  
+ The tag index patch is located at [https://issues.apache.org/jira/browse/LUCENE-1292 LUCENE-1292].
 I had seen people mention using a [http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/index/ParallelReader.html
ParallelReader] to have an index that is static and an index that is dynamic appear as one
index.  The challenge with this type of system is to get the doc numbers to stay aligned.
 Google seems to have a realtime tag index system.  I figured there must be some way using
the Lucene architecture to achieve the same thing.  The method I came up with is to divide
the postings list into blocks.  Each block contains a set number of documents, the blocks
are not divided by actual byte size but by document number.  The blocks are unified using
a TagMultiTermDocs class.  When a block is changed it is written to RAM.  Once the RAM usage
hits a certain size, the disk and memory postings are merged to disk.  There needs to be coordination
between 
 this process and the merging of the segments.  Each Tag Index is associated with a segment.
 In Ocean the merging of segments is performed by the Ocean code and not IndexWriter so the
coordination does not involve hooking into IndexWriter.  Currently there needs to be a way
to obtain the doc id from an addDocument call from IndexWriter which needs a patch still.
 
  
  = Distributed Search =
  
@@ -126, +126 @@

  
  = Name Service =
  
- Name services can become quite complex.  For example it may be possible in the future to
use [http://hadoop.apache.org/zookeeper/ Zookeeper] which is a lock based service.  However
even by Zookeeper's own admission these types of lock services are hard to implement and use
correctly.  I think for Ocean it should be good enough in the first release to have an open
source SQL database that stores the nodes and the cells the nodes belong to.  Because there
is no master there is no need for a locking service.  The columns in the node table would
be id, status (online/offline), cellid, datecreated, datemodified.  The cell table would simply
be id, status, datecreated, datemodified.  Redundant name services may be created by replicating
these 2 tables.  I am also pondering an errors table where clients may report outages of a
node.  If there are enough outages of a particular node the name service marks the node as
offline.  Clients will be able to listen for events on a name ser
 vice related to cells, mainly the node status column.  This way if a node that was online
goes offline, the client will know about it and not send requests to it any longer.  
+ Name services can become quite complex.  For example it may be possible in the future to
use [http://hadoop.apache.org/zookeeper/ Zookeeper] which is a lock based service.  However
even by Zookeeper's own admission these types of lock services are hard to implement and use
correctly.  I think for Ocean it should be good enough in the first release to have an open
source SQL database that stores the nodes and the cells the nodes belong to.  Because there
is no master there is no need for a locking service.  The columns in the node table would
be id, location, status (online/offline), cellid, datecreated, datemodified.  The cell table
would simply be id, status, datecreated, datemodified.  Redundant name services may be created
by replicating these 2 tables.  I am also pondering an errors table where clients may report
outages of a node.  If there are enough outages of a particular node the name service marks
the node as offline.  Clients will be able to listen for events on 
 a name service related to cells, mainly the node status column.  This way if a node that
was online goes offline, the client will know about it and not send requests to it any longer.
 
  
  = Location Based Services =
  

Mime
View raw message