lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "OceanRealtimeSearch" by JasonRutherglen
Date Thu, 10 Jul 2008 21:05:22 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by JasonRutherglen:
http://wiki.apache.org/lucene-java/OceanRealtimeSearch

New page:
= Introduction =

Ocean enables realtime search written in Java using Lucene.  It is currently in patch phase
at http://issues.apache.org/jira/browse/LUCENE-1313

= How it Works =

Ocean writes updates to a transaction log and an in memory index.  A transaction consists
of document adds and deletes.  If a transaction consists of (default: 100) or less documents,
the documents are serialized.  If greater than (default: 100), the documents are encoded into
a Lucene segment that is written to the log.  The latter reduces redundant analyzing if the
transaction log is being replicated.  

The in memory index is actually a series of indexes that are periodically merged in memory.
 When documents are first added, they are placed into a WriteableMemoryIndex that uses the
Lucene contrib project [http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/store/instantiated/InstantiatedIndex.html
InstantiatedIndex].  InstantiatedIndex provides an in memory index where all of the objects
are stored as is which makes for fast insert times because there is no serialization to bytes
like with a RAMDirectory.  Once the InstantiatedIndex reaches a predefined limit, it is turned
into a RamIndex.  The RamIndex uses a RAMDirectory and is an intermediary step before the
in memory index is written to disk.  The DiskIndex uses the normal FSDirectory.  

Ocean uses a different than usual process to write indexes to disk.  Instead of merging on
disk, meaning reading from indexes on disk and writing to the new index at the same time,
the merge process occurs in RAM.  This happens with the RamIndex where it is in RAM and simply
written to disk.  When multiple DiskIndexes are merged, the new index is first created in
RAM using RAMDirectory and then copied to disk.  The reason for creating the index first in
RAM is to save on rapid hard drive head movement.  Usually DiskIndexes are partially in the
system file cache.  The normal merging process therefore is fast for reads and slow for the
incremental write process.  Hard drives are optimized for large sequential writes which is
the described mechanism Ocean performs by first creating the index in RAM.  

Every transaction internally is recognized as a snapshot.  A snapshot (org.apache.lucene.ocean.Snapshot)
consists of a series of IndexSnapshots (org.apache.lucene.ocean.Index.IndexSnapshot).  The
parent class of DiskIndex and RamIndex is DirectoryIndex.  DirectoryIndex uses IndexReader.clone
http://issues.apache.org/jira/browse/LUCENE-1314 in the creation of an IndexSnapshot.  IndexReader.clone
creates a copy of an IndexReader that can be modified without altering the original IndexReader
like IndexReader.reopen does.  DirectoryIndexSnapshots never have documents added to them
as they are single segment optimized indexes.  DirectoryIndexSnapshots are only deleted from.
 Each each transaction with deletes does not result in a IndexReader.flush call because this
process is expensive.  Instead, because the transaction is already stored on disk in the transaction
log, the deletes occur only to the SegmentReader.deletedDocs.  

= Transaction Log =

Each transaction is recorded in the transaction log which is a series of files with the file
name format log00000001.bin.  The suffix number and a new log file is created when the current
log file reaches a predefined size limit.  The class org.apache.lucene.ocean.log.LogFileManager
is responsible for this process.  

The transaction record consists of three separate parts, the header, document bytes, and other
bytes.  The other bytes can store anything other than the documents, usually the deletes serialized.
 Each part has a CRC32 check which insures integrity of data.  The transaction log can become
corrupted if the process is stopped in the middle of a write.  There CRC32 check with each
part because they are loaded separately at different times.  

Mime
View raw message