lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "OceanRealtimeSearch" by JasonRutherglen
Date Sun, 31 Aug 2008 16:10:25 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by JasonRutherglen:
http://wiki.apache.org/lucene-java/OceanRealtimeSearch

------------------------------------------------------------------------------
  
  In a master slave architecture the update is submitted to the master first and then to the
slaves which is not performed in parallel.  Configuration changes such as turning [http://mysqlha.blogspot.com/2007/05/semi-sync-replication-for-mysql-5037.html
semi-sync] on or off would require restarting all processes in the system.  
  
- Perhaps the best way to implement replication is to simply let the client handle the updates
to the nodes.  The client generates a globally unique object id and calls the remote update
method concurrently on the nodes.  If there are many nodes this will make the system perform
faster than waiting for the master to do work and then the slaves.  It also allows the client
to be in control of how long it wants to wait for a transaction to complete, how many nodes
it needs for a transaction to be considered successful.  If there is an error the update call
may be revoked across the nodes.  If this fails there is a process on each node to rectify
transactions that are inconsistent with those of other nodes.  This is more like how biology
works I believe.  The master slave architecture seems somewhat barbaric in it's connotations.
 
+ The ideal architecture would allow any node to act as the proxy for the other nodes.  This
would make every node a master.  The transaction would be submitted to all nodes and the client
would determine on how many nodes the transaction needs to be successful.  In the event a
transaction fails on a node, nodes are always executing a polling operation to all other nodes
that rectifies transactions.  This does not need to run too often, however if a node is just
coming back online, it needs to reject queries until it is up to date.  The node may obtain
the latest transactions from any other node.  
+ 
+ When a new node comes online, it will need to simply download the entire set of Lucene index
files from another node.  The transaction log will not always have all transactions contained
in it's indexes because there is no need.  It is faster for a new node to download the indexes
first, then obtain the transactions it does not have from another node's transaction log.
  
  Because the Ocean system stores the entire document on an update and there is no support
for update of specific fields like SQL, it is much easier to rectify transactions between
nodes.  Meaning that deletes and updates of objects are less likely to clobber each other
during the rectification process.  
  
@@ -71, +73 @@

  
  = Storing the Data =
  
- SOLR uses a schema.  I chose not to use a schema because the realtime index should be able
to change at any time.  Instead the raw Lucene field classes such as Store, TermVector, and
Indexed are exposed in the OceanObject class.  An analyzer is defined on a per field per OceanObject
basis.  Using serialization, this process is not slow and is not bulky over the network as
serialization performs referencing of redundant objects.  GData allows the user to store multiple
types for a single field.  For example, a field named battingaverage may contain fields of
type long and text.  I am really not sure how Google handles this underneath.  I decided to
use Solr's NumberUtils class that encodes numbers into sortable strings.  This allows range
queries and other enumerations of the field to return the values in their true order rather
than string order.  One method I came up with to handle potentially different types in a field
is to prepend a letter signifying the type of the val
 ue for untokenized fields.  For a string the value would be "s0.323" and for a long "l845445".
 This way when sorting or enumerating over the values they stay disparate and can be modified
to be their true value upon return of the call.  Perhaps there is a better method.
+ SOLR uses a schema.  I chose not to use a schema because the realtime index should be able
to change at any time.  Instead the raw Lucene field classes such as Store, TermVector, and
Indexed are exposed in the OceanObject class.  An analyzer is defined on a per field per OceanObject
basis.  The process of serializing analyzers is not slow and is not bulky over the network
as serialization performs referencing of redundant objects in the data stream.  GData allows
the user to store multiple types for a single field.  For example, a field named battingaverage
may contain fields of type double and text.  I am really not sure how Google handles this
underneath.  I decided to use Solr's NumberUtils class that encodes numbers into sortable
strings.  This allows range queries and other enumerations of the field to return the values
in their true order rather than string order.  One method I came up with to handle potentially
different types in a field is to prepend a letter signif
 ying the type of the value for untokenized fields.  For a string the value would be "s0.323"
and for a long "l845445".  This way when sorting or enumerating over the values they stay
disparate and can be modified to be their true value upon return of the call.  Perhaps there
is a better method.
  
  Since writing the above I came up with an alternative mechanism to handle any number in
an Field.Index.UN_TOKENIZED field.  If the field string can be parsed into a double, then
the number is encoded using SOLR's NumberUtils into an encoded double string.  There may be
edge cases I am unaware of that make this system not work, but for right now it looks like
it will work.  In order to properly process a query, terms with fields in the query that are
Field.Index.UN_TOKENIZED will need to be checked for having a number by attempted parsing.
 If the value can be parsed into a double then it is encoded into a double string and replaced
in the term.  A similar process will be used for date strings which will conform to the ISO
8601 standard.  The user may also explicitly define the type of a number by naming the field
such as "battingaverage_double" where the type is defined by an underscore and then the type
name.  This is similar to Solr's dynamic fields construction.  
  
@@ -84, +86 @@

  Distributed search with Ocean will use the http://issues.apache.org/jira/browse/LUCENE-1336
patch.  It provides RMI functionality over the Hadoop IPC protocol.  Using Hadoop IPC as a
transport has advantages over using Sun's RMI because it is simpler and uses NIO.  In large
systems using NIO reduces thread usage and allows the overall system to scale better.  LUCENE-1336
allows classes to be dynamically loaded by the server from the client on a per client basis
to avoid problems with classloaders and class versions.  Using a remote method invocation
system for me is much faster to implement functionality than when using Solr and implementing
XML interfaces and clients or using namedlists.  I prefer writing distributed code using Java
objects because they are what I am more comfortable with.  Also I worked on Jini and Sun and
one might say it is in the blood.  The idea to create a better technique for classloading
comes from my experiences and the failures of trying to imple
 ment Jini systems.  Search is a fairly straightforward non-changing problem and so the dynamic
classloading is only required by the server from the client.  By having a reduced scope problem
the solution was much easier to generate compared to working with Jini which attempted to
solve all potential problems even if they most likely do not exist.  
  
  In the future it is possible to write a servlet wrapper around the Ocean Java client and
expose the Ocean functionality as XML possibly conforming to [http://www.opensearch.org OpenSearch]
and/or GData.  
+ 
+ An object is localized to a cell.  Meaning after it is created it usually remains in the
same cell over it's lifespan.  This is to insure the searches remain consistent.  The object
contains the cellid of where it originated from.  This allows subsequent updates to the object
(in Lucene a deleteDocument and then addDocument are called) to occur in the correct cell.
 
+ 
+ = Name Service =
+ 
+ Name services can become quite complex.  For example it may be possible in the future to
use Zookeeper which is a lock based service.  However even by Zookeeper's own admission these
types of lock services are hard to implement and use correctly.  I think for Ocean it should
be good enough in the first release to have an open source SQL database that stores the nodes
and the cells the nodes belong to.  Because there is no master there is no need for a locking
service.  The columns in the node table would be id, status (online/offline), cellid, datecreated,
datemodified.  The cell table would simply be id, status, datecreated, datemodified.  Redundant
name services may be created by replicating these 2 tables.  I am also pondering an errors
table where clients may report outages of a node.  If there are enough outages of a particular
node the name service marks the node as offline.  Clients will be able to listen for events
on a name service related to cells, mainly the node
  status column.  This way if a node that was online goes offline, the client will know about
it and not send requests to it any longer.  
  
  = Location Based Services =
  

Mime
View raw message