lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "OceanRealtimeSearch" by JasonRutherglen
Date Wed, 27 Aug 2008 13:39:32 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by JasonRutherglen:
http://wiki.apache.org/lucene-java/OceanRealtimeSearch

------------------------------------------------------------------------------
  = Introduction =
  
- Ocean enables realtime search written in Java using Lucene.  It is currently in patch phase
at [http://issues.apache.org/jira/browse/LUCENE-1313 LUCENE-1313].  Ocean offers a way for
Lucene based applications to take advantage of realtime search.  Realtime search makes search
systems more like a database.  This is probably why Google calls it's system [http://code.google.com/apis/gdata/
GData].  GData is offered as an online service and not software.  Ocean addresses this by
providing the same functionality as GData open sourced for use in any project.  GData does
not provide facets, this is something that Ocean can provide in the future.  [http://code.google.com/apis/base/
GBase] which is a cousin of GData offers location based search.  Ocean can offer the same
in the future.  By open sourcing realtime search more functionality may be built in over time
by the community which is something GData being an online service cannot do.  Google does
not offer realtime search in it
 's search appliance.  I am unaware of other search vendors offering realtime search.  
+ Ocean enables realtime search written in Java using Lucene.  It is currently in patch phase
at [http://issues.apache.org/jira/browse/LUCENE-1313 LUCENE-1313].  Ocean offers a way for
Lucene based applications to take advantage of realtime search.  Realtime search makes search
systems more like a database.  This is probably why Google calls it's system [http://code.google.com/apis/gdata/
GData].  GData is offered as an online service and not software.  Ocean addresses this by
providing the same functionality as GData open sourced for use in any project.  GData does
not provide facets, this is something that Ocean can provide in the future.  [http://code.google.com/apis/base/
GBase] which is a cousin of GData offers location based search.  Ocean offers location based
search using [http://sourceforge.net/projects/locallucene/ LocalLucene].  By open sourcing
realtime search more functionality may be built in over time by the community which is something
GData being an online se
 rvice cannot do.  Google does not offer realtime search in it's search appliance.  I am unaware
of other search vendors offering realtime search.  
  
  = Background =
  
@@ -10, +10 @@

  
  I was an early user of Solr when GData came out.  They were similar in that they were both
search exposed as XML.  GData however offered realtime search and Solr offered batch processing.
 I worked for a social networking company that wanted the updates available as fast as possible.
 It was hard to achieve anything below a couple of minutes as the queries the company wanted
used a sort.  In Lucene a sort loads the field cache into RAM which on a large index is expensive.
 There are ways to solve this but they were not available.  In any case I wanted to figure
out a way to allow updates to be searchable in a minimal amount of time as possible while
also offering functionality like SOLR of replication and facets.  The one thing GData had
over Solr was realtime updates or the ability to add, delete, or update a document and be
able to see the update in search results immediately.  With Solr the company had decided on
a 10 minute interval of updating the index with delta upda
 tes from an Oracle database.  I wanted to see if it was possible with Lucene to create an
approximation of what GData does.  The result is Ocean.
  
- The use case it was designed for is websites with dynamic data, some of which are social
networking, photo sites, discussions boards, blogs, wikis, and such.  More broadly it is possible
to use Ocean with any application that requires the database like feature of immediate updates.
 Probably the best example of this is all of Google's web applications, outside of web search,
uses a GData interface.  Meaning the primary datastore is not mysql or some equivalent, it
is a proprietary search based database.  The best example of this is Gmail.  If I receive
an email through Gmail I can also search on it immediately, there is no 10 minute delay. 
Also in Gmail I can change labels, a common example being changing unread emails to read in
bulk.  Presumably Gmail is not reindexing the entire email for each label change. 
+ The use case it was designed for is websites with dynamic data, some of which are social
networking, photo sites, discussions boards, blogs, wikis, and such.  More broadly it is possible
to use Ocean with any application that requires the database like feature of immediate updates.
 Probably the best example of this is all of Google's web applications, outside of web search,
uses a GData interface.  Meaning the primary datastore is not mysql or some equivalent, it
is a proprietary search based database.  The best example of this is Gmail.  If I receive
an email through Gmail I can  also search on it immediately, there is no 10 minute delay.
 Also in Gmail I can change labels, a common example being changing unread emails to read
in bulk.  Presumably Gmail is not reindexing the entire email for each label change. 
  
  Most highly trafficked web applications do not use the relational facilities like joins
because they are too expensive.  Lucene does not offer joins so this is fine.  The only area
Lucene is currently weak in is range queries.  Mysql uses a btree index whereas Lucene uses
the time consuming TermEnum and TermDocs combination.  This is an area Tag Index addresses.

  
@@ -18, +18 @@

  
  = What I Learned =
  
- Merging is expensive and detrimental to realtime search.  The more merging that occurs during
the update call, the longer it takes for the update to become available.  Using IndexWriter.addDocument,
committing and then calling IndexReader.reopen takes time because a merge must occur during
the addDocument call.  I learned that I needed to design a system that would not perform merging
in the foreground during the update call, and have the merging performed in a background thread.
 Karl Wettin had created InstantiatedIndex and it took some time to figure out that it was
the right object to use to create an in memory representation of a document that would be
immediately searchable.  The issue of losing data is solved using the tried and true method
that Mysql uses which is a binary transaction log of what they call the queries, in Lucene
it is the documents.  
+ Merging is expensive and detrimental to realtime search.  The more merging that occurs during
the update call, the longer it takes for the update to become available.  Using IndexWriter.addDocument,
committing and then calling IndexReader.reopen takes time because a merge must occur during
the commit call, which would be called after each transaction.  I learned that I needed to
design a system that would not perform merging in the foreground during the update call, and
have the merging performed in a background thread.  Karl Wettin had created InstantiatedIndex
and it took some time to figure out that it was the right object to use to create an in memory
index of document(s) that would be immediately searchable.  The issue of losing data is solved
by the standard method Mysql uses which is a binary transaction log of the serialized documents
and deletes.
  
  Lucene uses a snapshot system that is embodied in the IndexReader class.  Each IndexReader
is a snapshot of the index with associated files.  Ocean uses an IndexReader per snapshot
however the IndexReaders are created more often.  This means the IndexReaders are also disposed
of much more quickly than in a system like SOLR.  A lot of design work went into creating
a system that would allow the IndexReaders to be created and then to remove them when they
are no longer required.  A referencing system was created for each snapshot where Java code
may lock a snapshot, do work and unlock it.  Only a set number of snapshots need to be available
at a given time and the older unlocked snapshots are removed.  Deletes occur in ram directly
to the bitvector of the IndexReader with no flush to disk.  
  
  = How it Works =
  
- Ocean writes updates to a transaction log and an in memory index.  A transaction consists
of document adds and deletes.  If a transaction consists of (default: 100) or less documents,
the documents are serialized.  If greater than (default: 100), the documents are encoded into
a Lucene segment that is written to the log.  The latter reduces redundant analyzing if the
transaction log is being replicated.  
+ Ocean writes updates to a transaction log and an in memory index.  A transaction consists
of document adds and deletes.  If a transaction consists of (default: 100) or less documents,
the documents are serialized to the transaction log.  If greater than (default: 100), the
documents are encoded into a Lucene segment by using an IndexWriter to write the documents
to a RAMDirectory that is serialized to the transaction log.  The latter reduces redundant
analyzing if the transaction log is being replicated.  
  
- The in memory index is actually a series of indexes that are periodically merged in memory.
 When documents are first added, they are placed into a WriteableMemoryIndex that uses the
Lucene contrib project [http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/store/instantiated/InstantiatedIndex.html
InstantiatedIndex].  InstantiatedIndex provides an in memory index where all of the objects
are stored as is which makes for fast insert times because there is no serialization to bytes
like with a RAMDirectory.  Once the InstantiatedIndex reaches a predefined limit, it is turned
into a RamIndex.  The RamIndex uses a RAMDirectory and is an intermediary step before the
in memory index is written to disk.  The DiskIndex uses the normal FSDirectory.  
+ The in memory index is actually a series of indexes that are periodically merged in memory.
 When documents are first added, they are placed into a WriteableMemoryIndex that uses the
Lucene contrib project [http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/store/instantiated/InstantiatedIndex.html
InstantiatedIndex].  InstantiatedIndex provides an in memory index where all of the objects
are stored as is which makes for fast insert times because there is no serialization to bytes
as with a RAMDirectory.  Once the InstantiatedIndex reaches a predefined limit, it is converted
into a RamIndex.  The RamIndex uses a RAMDirectory and is an intermediary step before the
in memory index is written to disk.  RamIndexes are periodically merged in the background
as well.  The DiskIndex utilizes FSDirectory.  
  
- Ocean uses a different than usual process for writing indexes to disk.  Instead of merging
on disk, meaning reading from indexes on disk and writing to the new index at the same time,
the merge process occurs in RAM.  This happens with the RamIndex where it is in RAM and simply
written to disk.  When multiple DiskIndexes are merged, the new index is first created in
RAM using RAMDirectory and then copied to disk.  The reason for creating the index first in
RAM is to save on rapid hard drive head movement.  Usually DiskIndexes are partially in the
system file cache.  The normal merging process therefore is fast for reads and slow for the
incremental write process.  Hard drives are optimized for large sequential writes which is
the described mechanism Ocean performs by first creating the index in RAM.  
+ Ocean uses a different than usual process for writing indexes to disk.  Instead of merging
on disk, meaning reading from indexes on disk and writing to the new index at the same time,
the merge process occurs in RAM.  This happens with the RamIndex where it is in RAM and simply
written to disk.  When multiple DiskIndexes are merged, the new index is first created in
RAM using RAMDirectory and then copied to disk.  The reason for creating the index first in
RAM is to save on rapid hard drive head movement.  Usually DiskIndexes are partially in the
system file cache.  The normal merging process therefore is fast for reads and slow for the
incremental write process.  Hard drives are optimized for large sequential writes which is
the described mechanism Ocean performs by first creating the index in RAM.  The large segment
which is typically 64MB in size is written all at once which should take 5-10 seconds.  
  
- Every transaction internally is recognized as a snapshot.  A snapshot (org.apache.lucene.ocean.Snapshot)
consists of a series of IndexSnapshots (org.apache.lucene.ocean.Index.IndexSnapshot).  The
parent class of DiskIndex and RamIndex is DirectoryIndex.  DirectoryIndex uses IndexReader.clone
http://issues.apache.org/jira/browse/LUCENE-1314 in the creation of an IndexSnapshot.  IndexReader.clone
creates a copy of an IndexReader that can be modified without altering the original IndexReader
like IndexReader.reopen does.  DirectoryIndexSnapshots never have documents added to them
as they are single segment optimized indexes.  DirectoryIndexSnapshots are only deleted from.
 Each each transaction with deletes does not result in a IndexReader.flush call because this
process is expensive.  Instead, because the transaction is already stored on disk in the transaction
log, the deletes occur only to the SegmentReader.deletedDocs.  
+ Each transaction internally is recognized as a snapshot.  A snapshot (org.apache.lucene.ocean.Snapshot)
consists of a series of IndexSnapshots (org.apache.lucene.ocean.Index.IndexSnapshot).  The
parent class of DiskIndex and RamIndex is DirectoryIndex.  DirectoryIndex uses IndexReader.clone
http://issues.apache.org/jira/browse/LUCENE-1314 in the creation of an IndexSnapshot.  IndexReader.clone
creates a copy of an IndexReader that can be modified without altering the original IndexReader
like IndexReader.reopen does.  DirectoryIndexSnapshots never have documents added to them
as they are single segment optimized indexes.  DirectoryIndexSnapshots are only deleted from.
 Each each transaction with deletes does not result in a IndexReader.flush call because this
process is expensive.  Instead, because the transaction is already stored on disk in the transaction
log, the deletes occur only to the SegmentReader.deletedDocs.  
+ 
+ Facets and filters need to be cached per Index.  Each Index is really the same as a Lucene
segment.  However due to the way Lucene is designed, if one is merging outside of IndexWriter
then each segment needs to be in it's own physical directory.  This creates some extra files
such as the segmentinfos file.  Ocean manages deleting the old index directories when they
are no longer necessary.  
  
  = Transaction Log =
  
@@ -42, +44 @@

  
  There are two ways to do replication and I have been leaning towards a non master slave
architecture.  I looked at the Paxos at War algorithm for master slave failover.  The problem
is, I did not understand it, and found it too complex to implement.  I tried other more simple
ways of implementing master slave failover and it still had major problems.  This led me to
look for another solution.  
  
- Perhaps the best way to implement replication is to simply let the client handle the updates
to the nodes.  The client generates a globally unique object id and calls the remote update
method concurrently on the nodes.  In a master slave architecture the update is submitted
to the master first and then to the slaves which is not performed in parallel.  If there is
an error the update call may be revoked across the nodes.  If this fails there is a process
on each node to rectify transactions that are inconsistent with those of other nodes.  This
is more like how biology works I believe.  The master slave architecture seems somewhat barbaric
in it's connotations.  
+ In a master slave architecture the update is submitted to the master first and then to the
slaves which is not performed in parallel.  Configuration changes such as turning [http://mysqlha.blogspot.com/2007/05/semi-sync-replication-for-mysql-5037.html
semi-sync] on or off would require restarting all processes in the system.  
  
+ Perhaps the best way to implement replication is to simply let the client handle the updates
to the nodes.  The client generates a globally unique object id and calls the remote update
method concurrently on the nodes.  If there are many nodes this will make the system perform
faster than waiting for the master to do work and then the slaves.  It also allows the client
to be in control of how long it wants to wait for a transaction to complete, how many nodes
it needs for a transaction to be considered successful.  If there is an error the update call
may be revoked across the nodes.  If this fails there is a process on each node to rectify
transactions that are inconsistent with those of other nodes.  This is more like how biology
works I believe.  The master slave architecture seems somewhat barbaric in it's connotations.
 
+ 
- Because the Ocean system stores the entire document on an update and there is no support
for update of specific fields like as SQL, it is much easier to rectify transactions between
nodes.  Meaning that deletes and updates of objects are less likely to clobber each other
during the rectification process.  
+ Because the Ocean system stores the entire document on an update and there is no support
for update of specific fields like SQL, it is much easier to rectify transactions between
nodes.  Meaning that deletes and updates of objects are less likely to clobber each other
during the rectification process.  
  
  = Facets =
  
  I wanted facets to work in realtime because it seemed like a challenging thing to do.  The
way I came up with to do this is a copy on read versioned LRU cache.  The bit sets for faceting
need to be cached.  The problem is, each transaction may perform deletes and the bit set needs
to reflect this.  Rather than perform deletes on all of the cached bit sets for each transaction
(which would consume a large amount of RAM and create a lot of garbage) a copy on read is
used.  The bit set cache stores the deletes docs of each snapshot/transaction.  If a given
bit set is required and the value is out of date then the deletes are applied to a new one.
 Each value in the cache stores multiple versions of a bit set.  Periodically as snapshots
are released by the system the older bit sets are also released.  This system is efficient
because only the used bit sets are brought up to date with the latest snapshot.
  
+ Facet caching needs to be handled per segment and merged during the search results merging.
 
+ 
  = Storing the Data =
  
- SOLR uses a schema.  I chose not to use a schema because the realtime index should be able
to change at any time.  Instead the raw Lucene field classes such as Store, TermVector, and
Indexed are exposed in the OceanObject class.  Even the analyzer is defined on a per field
per object basis.  Using serialization, this process is not slow no bulky over the network
as serialization performs referencing of redundant objects.  GData allows the user to store
multiple types for a single field.  For example, a field named battingaverage may contain
fields of type long and text.  I am really not sure how Google handles this underneath.  I
decided to use Solr's NumberUtils class that encodes numbers into sortable strings.  This
allows range queries and other enumerations of the field to return the values in their true
order rather than string order.  One method I came up with to handle potentially different
types in a field is to prepend a letter signifying the type of the value.  So
  for a string the value would be "s0.323" and for a long "l845445".  This way when sorting
or enumerating over the values they stay disparate and can be modified to be their true value
upon return of the call.  If there is a better way I would like to know.
+ SOLR uses a schema.  I chose not to use a schema because the realtime index should be able
to change at any time.  Instead the raw Lucene field classes such as Store, TermVector, and
Indexed are exposed in the OceanObject class.  An analyzer is defined on a per field per OceanObject
basis.  Using serialization, this process is not slow and is not bulky over the network as
serialization performs referencing of redundant objects.  GData allows the user to store multiple
types for a single field.  For example, a field named battingaverage may contain fields of
type long and text.  I am really not sure how Google handles this underneath.  I decided to
use Solr's NumberUtils class that encodes numbers into sortable strings.  This allows range
queries and other enumerations of the field to return the values in their true order rather
than string order.  One method I came up with to handle potentially different types in a field
is to prepend a letter signifying the type of the val
 ue for untokenized fields.  For a string the value would be "s0.323" and for a long "l845445".
 This way when sorting or enumerating over the values they stay disparate and can be modified
to be their true value upon return of the call.  Perhaps there is a better method.
  
  = Tag Index =
  
@@ -60, +66 @@

  
  = Distributed Search =
  
- Distributed search with Ocean will use the http://issues.apache.org/jira/browse/LUCENE-1336
patch.  It provides RMI functionality over the Hadoop IPC protocol.  Using Hadoop IPC as a
transport has advantages over using Sun's RMI because it is simpler and uses NIO.  In large
systems using NIO reduces thread usage and allows the overall system to scale better.  LUCENE-1336
allows classes to be dynamically loaded by the server from the client on a per client basis
to avoid problems with classloaders and class versions.  Using a remote method invocation
system for me is much faster to implement functionality than when using Solr and implementing
XML interfaces and clients or using namedlists.  I prefer writing distributed code using Java
objects because they are what I am more comfortable with.  Also I worked on Jini and Sun and
one might say it is in the blood.  The idea to create a better technique for classloading
comes from my experiences and the failures of trying to imple
 ment Jini systems.  Search is a fairly straightforward non-changing problem and so the dynamic
classloading is only required by the server from the client.  By having a reduced scope problem
the solution was much easier to generate compared to working with Jini which attempted to
solve all potential problems even if they probably do not exist.  
+ Distributed search with Ocean will use the http://issues.apache.org/jira/browse/LUCENE-1336
patch.  It provides RMI functionality over the Hadoop IPC protocol.  Using Hadoop IPC as a
transport has advantages over using Sun's RMI because it is simpler and uses NIO.  In large
systems using NIO reduces thread usage and allows the overall system to scale better.  LUCENE-1336
allows classes to be dynamically loaded by the server from the client on a per client basis
to avoid problems with classloaders and class versions.  Using a remote method invocation
system for me is much faster to implement functionality than when using Solr and implementing
XML interfaces and clients or using namedlists.  I prefer writing distributed code using Java
objects because they are what I am more comfortable with.  Also I worked on Jini and Sun and
one might say it is in the blood.  The idea to create a better technique for classloading
comes from my experiences and the failures of trying to imple
 ment Jini systems.  Search is a fairly straightforward non-changing problem and so the dynamic
classloading is only required by the server from the client.  By having a reduced scope problem
the solution was much easier to generate compared to working with Jini which attempted to
solve all potential problems even if they most likely do not exist.  
  
  In the future it is possible to write a servlet wrapper around the Ocean Java client and
expose the Ocean functionality as XML possibly conforming to [http://www.opensearch.org OpenSearch]
and/or GData.  
  
+ = Location Based Services =
+ 
+ [http://sourceforge.net/projects/locallucene/ LocalLucene] provides the functionality for
location based queries.  It is possible to optimize how LocalLucene works and I had code that
implemented LocalLucene functionality directly into Ocean that I may put back in at some point.
 The optimization works by implementing a subclass of ScoreDoc that has a Distance object
as a member variable.  This removes the need for a map of the document to the distance value
from the DistanceFilter.  I would like to see DistanceFilter use the new Lucene Filter code
that returns DocIdSet.  
+ 
+ = To Do =
+ 
+  * Finish and write test cases for OceanDatabase
+  * Facets
+  * Filter caching
+  * Distributed updates
+  * Name service
+  * Rework the code to allow UUID strings as the transaction ids rather than longs for the
distributed updates
+  * Test case for LargeBatch
+ 

Mime
View raw message