lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "OceanRealtimeSearch" by JasonRutherglen
Date Tue, 02 Sep 2008 23:04:29 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by JasonRutherglen:
http://wiki.apache.org/lucene-java/OceanRealtimeSearch

------------------------------------------------------------------------------
  
  Ocean uses a different than usual process for writing indexes to disk.  Instead of merging
on disk, meaning reading from indexes on disk and writing to the new index at the same time,
the merge process occurs in RAM.  This happens with the RamIndex where it is in RAM and simply
written to disk.  When multiple DiskIndexes are merged, the new index is first created in
RAM using RAMDirectory and then copied to disk.  The reason for creating the index first in
RAM is to save on rapid hard drive head movement.  Usually DiskIndexes are partially in the
system file cache.  The normal merging process therefore is fast for reads and slow for the
incremental write process.  Hard drives are optimized for large sequential writes which is
the described mechanism Ocean performs by first creating the index in RAM.  The large segment
which is typically 64MB in size is written all at once which should take 5-10 seconds.  
  
- Each transaction internally is recognized as a snapshot.  A snapshot (org.apache.lucene.ocean.Snapshot)
consists of a series of IndexSnapshots (org.apache.lucene.ocean.Index.IndexSnapshot).  The
parent class of DiskIndex and RamIndex is DirectoryIndex.  DirectoryIndex uses IndexReader.clone
http://issues.apache.org/jira/browse/LUCENE-1314 in the creation of an IndexSnapshot.  IndexReader.clone
creates a copy of an IndexReader that can be modified without altering the original IndexReader
like IndexReader.reopen does.  DirectoryIndexSnapshots never have documents added to them
as they are single segment optimized indexes.  DirectoryIndexSnapshots are only deleted from.
 Each each transaction with deletes does not result in a IndexReader.flush call because this
process is expensive.  Instead, because the transaction is already stored on disk in the transaction
log, the deletes occur only to the SegmentReader.deletedDocs.  
+ Each transaction internally is recognized as a snapshot.  A snapshot (org.apache.lucene.ocean.Snapshot)
consists of a series of IndexSnapshots (org.apache.lucene.ocean.Index.IndexSnapshot).  The
parent class of DiskIndex and RamIndex is DirectoryIndex.  DirectoryIndex uses IndexReader.clone
http://issues.apache.org/jira/browse/LUCENE-1314 in the creation of an IndexSnapshot.  IndexReader.clone
creates a copy of an IndexReader that can be modified without altering the original IndexReader
like IndexReader.reopen does.  DirectoryIndexSnapshots never have documents added to them
as they are single segment optimized indexes.  DirectoryIndexSnapshots are only deleted from.
 Each transaction with deletes does not result in an IndexReader.flush call because this process
is expensive.  Instead, because the transaction is already stored on disk in the transaction
log, the deletes occur only to the SegmentReader.deletedDocs.  
  
  Facets and filters need to be cached per Index.  Each Index is really the same as a Lucene
segment.  However due to the way Lucene is designed, if one is merging outside of IndexWriter
then each segment needs to be in it's own physical directory.  This creates some extra files
such as the segmentinfos file.  Ocean manages deleting the old index directories when they
are no longer necessary.  
  
@@ -48, +48 @@

  Each transaction is recorded in the transaction log which is a series of files with the
file name format log00000001.bin.  The suffix number and a new log file is created when the
current log file reaches a predefined size limit.  The class org.apache.lucene.ocean.log.LogFileManager
is responsible for this process.  
  
  The transaction record consists of three separate parts, the header, document bytes, and
other bytes.  The other bytes can store anything other than the documents, usually the deletes
serialized.  Each part has a CRC32 check which insures integrity of data.  The transaction
log can become corrupted if the process is stopped in the middle of a write.  There is a CRC32
check with each part because they are loaded separately at different times.  For example during
the recovery process on the Ocean server startup the documents are the first to be loaded
and in memory indexes are created.  Then the deletes from the transactions are executed. 
Then the indexes are optimized to remove the deleted documents.  The process described is
much faster than performing each transaction incrementally during recovery.  It is important
to note that internally each delete, especially the delete by query is saved as the actual
document ids that were deleted when the transaction was committed.  
 If the system simply re-executed the delete by query, then the transaction would create inconsistent
results.  
+ 
+ = Snapshot Log =
+ 
+ The snapshot log is a set of rolling log files that contain the snapshot information.  Each
transaction generates a new snapshot entry in the current snapshot log file.  The snapshot
element contains the index elements.  The log is in XML for human viewing which is useful
for debugging.  
+ 
+ Example:
+ 
+ <snapshot id="29.02" numDocs="10" maxDoc="25" deletedDocs="15">
+ 
+ <index snapshotid="974" id="787" segmentGeneration="401" type="disk" maxDoc="466" numDocs="442"
deletedDoc="95" minDocumentId="117" maxDocumentId="483" minSnapshotId="693" maxSnapshotId="116"
deleteFlushId="876" lastAppliedId="780" />
+ </snapshot>
+ 
+ ||Name||Value||
+ ||snapshotid||The id of the snapshot||
+ ||id||The index id||
+ ||segmentGeneration||Segment generation of the index as reported by IndexReader.||
+ ||type||The type of the index||
+ ||maxDoc||The max doc value of the index||
+ ||numDocs||Number of documents in the index||
+ ||deletedDoc||Number of deleted documents in the index||
+ ||minDocumentId||The minimum document id in the index||
+ ||maxDocumentId||The maximum document id in the index||
+ ||minSnapshotId||The minimum snapshot id in the index||
+ ||maxSnapshotId||The maximum snapshot id in the index||
+ ||deleteFlushId||Snapshot id the last time the deleted docs were flushed to disk||
+ ||lastAppliedId||The last snapshot id that affected this index||
+ 
  
  = Replication =
  
@@ -63, +90 @@

  
  = Crowding =
  
- GBase mentions a feature that is perhaps somewhat interesting and this is [http://code.google.com/apis/base/attrs-queries.html#crowding
crowding].  It is similar to what in Solr is referred to as [https://issues.apache.org/jira/browse/SOLR-236
Field Collapsing] however the implementation for Ocean could be a little bit easier and more
efficient.  Solr's Field Collapse code performs a sort on the results first and then seems
to perform another query.  GBase allows only 2 fields to be crowded.  Also it would seem to
be easier to simply obtain more results than are needed and crowd a field similar to how the
NutchBean has a dedupField.  I have tried to implement this feature into Ocean and have been
unable to get it quite right. 
+ GBase mentions a feature that is perhaps somewhat interesting and this is [http://code.google.com/apis/base/attrs-queries.html#crowding
crowding].  It is similar to what in Solr is referred to as [https://issues.apache.org/jira/browse/SOLR-236
Field Collapsing] however the implementation for Ocean could be a little bit easier and more
efficient.  Solr's Field Collapse code performs a sort on the results first and then seems
to perform another query.  GBase allows only 2 fields to be crowded making the implementation
seem a bit easier.  Also it would seem to be easier to simply obtain more results than are
needed and crowd a field similar to how the NutchBean uses a dedupField.  I have tried to
implement this feature into Ocean and have been unable to get it quite right. 
  
  = Facets =
  
@@ -76, +103 @@

  SOLR uses a schema.  I chose not to use a schema because the realtime index should be able
to change at any time.  Instead the raw Lucene field classes such as Store, TermVector, and
Indexed are exposed in the OceanObject class.  An analyzer is defined on a per field per OceanObject
basis.  The process of serializing analyzers is not slow and is not bulky over the network
as serialization performs referencing of redundant objects in the data stream.  GData allows
the user to store multiple types for a single field.  For example, a field named battingaverage
may contain fields of type double and text.  I am really not sure how Google handles this
underneath.  I decided to use Solr's NumberUtils class that encodes numbers into sortable
strings.  This allows range queries and other enumerations of the field to return the values
in their true order rather than string order.  One method I came up with to handle potentially
different types in a field is to prepend a letter signif
 ying the type of the value for untokenized fields.  For a string the value would be "s0.323"
and for a long "l845445".  This way when sorting or enumerating over the values they stay
disparate and can be modified to be their true value upon return of the call.  Perhaps there
is a better method.
  
  Since writing the above I came up with an alternative mechanism to handle any number in
an Field.Index.UN_TOKENIZED field.  If the field string can be parsed into a double, then
the number is encoded using SOLR's NumberUtils into an encoded double string.  There may be
edge cases I am unaware of that make this system not work, but for right now it looks like
it will work.  In order to properly process a query, terms with fields in the query that are
Field.Index.UN_TOKENIZED will need to be checked for having a number by attempted parsing.
 If the value can be parsed into a double then it is encoded into a double string and replaced
in the term.  A similar process will be used for date strings which will conform to the ISO
8601 standard.  The user may also explicitly define the type of a number by naming the field
such as "battingaverage_double" where the type is defined by an underscore and then the type
name.  This is similar to Solr's dynamic fields construction.  
+ 
+ The following are default document fields.  Some are currently placed by TransactionSystem,
some by OceanDatabase.  It is probably best for most applications to use OceanDatabase so
all of them are mentioned here. 
+ ||Name||Value||
+ ||_id||Unique type long object id persistent across updates to an object assigned by Ocean.||
+ ||_documentid||Unique type long document id to uniquely identify the document.  This is
useful when performing deletes and the exact documents deleted need to be saved in the transaction.||
+ ||_version||Type long version of an object.||
+ ||_datecreated||Date the object was created.||
+ ||_datemodified||Date the object was last modified.||
  
  = Tag Index =
  
@@ -91, +126 @@

  
  = Name Service =
  
- Name services can become quite complex.  For example it may be possible in the future to
use Zookeeper which is a lock based service.  However even by Zookeeper's own admission these
types of lock services are hard to implement and use correctly.  I think for Ocean it should
be good enough in the first release to have an open source SQL database that stores the nodes
and the cells the nodes belong to.  Because there is no master there is no need for a locking
service.  The columns in the node table would be id, status (online/offline), cellid, datecreated,
datemodified.  The cell table would simply be id, status, datecreated, datemodified.  Redundant
name services may be created by replicating these 2 tables.  I am also pondering an errors
table where clients may report outages of a node.  If there are enough outages of a particular
node the name service marks the node as offline.  Clients will be able to listen for events
on a name service related to cells, mainly the node
  status column.  This way if a node that was online goes offline, the client will know about
it and not send requests to it any longer.  
+ Name services can become quite complex.  For example it may be possible in the future to
use [http://hadoop.apache.org/zookeeper/ Zookeeper] which is a lock based service.  However
even by Zookeeper's own admission these types of lock services are hard to implement and use
correctly.  I think for Ocean it should be good enough in the first release to have an open
source SQL database that stores the nodes and the cells the nodes belong to.  Because there
is no master there is no need for a locking service.  The columns in the node table would
be id, status (online/offline), cellid, datecreated, datemodified.  The cell table would simply
be id, status, datecreated, datemodified.  Redundant name services may be created by replicating
these 2 tables.  I am also pondering an errors table where clients may report outages of a
node.  If there are enough outages of a particular node the name service marks the node as
offline.  Clients will be able to listen for events on a name ser
 vice related to cells, mainly the node status column.  This way if a node that was online
goes offline, the client will know about it and not send requests to it any longer.  
  
  = Location Based Services =
  
  [http://sourceforge.net/projects/locallucene/ LocalLucene] provides the functionality for
location based queries.  It is possible to optimize how LocalLucene works and I had code that
implemented LocalLucene functionality directly into Ocean that I may put back in at some point.
 The optimization works by implementing a subclass of ScoreDoc that has a Distance object
as a member variable.  This removes the need for a map of the document to the distance value
from the DistanceFilter.  I would like to see DistanceFilter use the new Lucene Filter code
that returns DocIdSet.  
+ 
+ = Configuration Options =
+ 
+ ||Name|| Description||
+ ||serverNumber||The server number differentiates servers in a cell and is encoded into the
uniquely generated ids on a server as the first 2 digits of the id.||
+ ||memoryIndexMaxDocs||The maximum number of documents the InstantiatedIndex holds.  Internally
known as the WriteableMemoryIndex.||
+ ||maybeMergeDocChanges||Number of changes to the overall index before the system checks
to see if any indexes need to be merged.  Executing this on every transaction would be a waste.||
+ ||maxRamIndexesSize||Size in bytes all ram indexes after which they are written as a single
optimized index to disk.||
+ ||maxSnapshots||The maximum number of snapshots the system keeps around.  The system will
only remove the snapshot if it is unlocked.||
+ ||mergeDiskDeletedPercent||Disk indexes that have too many deleted documents need to be
merged to remove the deleted documents.  This is the percentage of deleted documents a DiskIndex
needs to have in order to be considered for merging.||
+ ||snapshotExpiration||Duration in milliseconds after which a snapshot is considered for
removal.||
+ ||deletesFlushThresholdPercent||The percentage of deleted docs after which the deleted docs
file is written for an index/segment.||
+ ||maybeMergesTimerInterval||The maybe merges call is sometimes started in the background
after a transaction, this milliseconds value is for running it in general in the background
according to a timer.||
+ ||logFileDeleteTimerInterval||Timer value in milliseconds for checking on deleting old log
files.||
+ ||diskIndexRAMDirectoryBufferSize||This is the buffer size in bytes to be used for the RAMDirectory
when multiple DiskIndexes are merged.  Ocean merges the DiskIndexes into a RAMDirectory first,
then flushes the RAMDirectory to disk.  This is to perform a large sequential write which
is much faster than an incremental write process which usually used in Lucene merges.||
  
  = To Do =
  

Mime
View raw message