lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "OceanRealtimeSearch" by JasonRutherglen
Date Fri, 29 Aug 2008 11:02:27 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by JasonRutherglen:
http://wiki.apache.org/lucene-java/OceanRealtimeSearch

------------------------------------------------------------------------------
  = Introduction =
  
  Ocean enables realtime search written in Java using Lucene.  It is currently in patch phase
at [http://issues.apache.org/jira/browse/LUCENE-1313 LUCENE-1313].  Ocean offers a way for
Lucene based applications to take advantage of realtime search.  Realtime search makes search
systems more like a database.  This is probably why Google calls it's system [http://code.google.com/apis/gdata/
GData].  GData is offered as an online service and not software.  Ocean addresses this by
providing the same functionality as GData open sourced for use in any project.  GData does
not provide facets, this is something that Ocean can provide in the future.  [http://code.google.com/apis/base/
GBase] which is a cousin of GData offers location based search.  Ocean offers location based
search using [http://sourceforge.net/projects/locallucene/ LocalLucene].  By open sourcing
realtime search more functionality may be built in over time by the community which is something
GData being an online se
 rvice cannot do.  Google does not offer realtime search in it's search appliance.  I am unaware
of other search vendors offering realtime search.  
+ 
+ There is a good [http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337
article] written by Adam Bosworth who seems to have headed up the GData project at Google.
 I think many of his points are quite valid.  It is worth mentioning the main points of the
article here as they also define the positive attributed of the Ocean open source search system.
+ 
+  * It is worth making things simple enough that one can harness Moore’s law in parallel
+  * It is acceptable to be stale much of the time
+  * Be as loosely coupled as possible
+  * SQL databases have problems harnessing Moore's law in parallel (whereas distributed search
systems do not)
+  * SQL databases have problems allowing users to evolve a schema over time.  Lucene offers
the ability to define any fields at any time including the ability to define multiple values
for a field something SQL databases do not offer.
+  * SQL queries are more complex than search queries which are more suitable for end users
  
  = Background =
  
@@ -20, +29 @@

  
  Merging is expensive and detrimental to realtime search.  The more merging that occurs during
the update call, the longer it takes for the update to become available.  Using IndexWriter.addDocument,
committing and then calling IndexReader.reopen takes time because a merge must occur during
the commit call, which would be called after each transaction.  I learned that I needed to
design a system that would not perform merging in the foreground during the update call, and
have the merging performed in a background thread.  Karl Wettin had created InstantiatedIndex
and it took some time to figure out that it was the right object to use to create an in memory
index of document(s) that would be immediately searchable.  The issue of losing data is solved
by the standard method Mysql uses which is a binary transaction log of the serialized documents
and deletes.
  
- Lucene uses a snapshot system that is embodied in the IndexReader class.  Each IndexReader
is a snapshot of the index with associated files.  Ocean uses an IndexReader per snapshot
however the IndexReaders are created more often.  This means the IndexReaders are also disposed
of much more quickly than in a system like SOLR.  A lot of design work went into creating
a system that would allow the IndexReaders to be created and then to remove them when they
are no longer required.  A referencing system was created for each snapshot where Java code
may lock a snapshot, do work and unlock it.  Only a set number of snapshots need to be available
at a given time and the older unlocked snapshots are removed.  Deletes occur in ram directly
to the bitvector of the IndexReader with no flush to disk.  
+ Lucene uses a snapshot system that is embodied in the IndexReader class.  Each IndexReader
is a snapshot of the index with associated files.  Ocean uses an IndexReader per snapshot
however the IndexReaders are created more often.  This means the IndexReaders are also disposed
of much more quickly than in a system like SOLR.  A lot of design work went into creating
a system that would allow the IndexReaders to be created and then to remove them when they
are no longer required.  A referencing system was created for each snapshot where Java code
may lock a snapshot, do work and unlock it.  Only a set number of snapshots need to be available
at a given time and the older unlocked snapshots are removed.  Deletes occur in ram directly
to the bitvector of the IndexReader with no flush to disk.  This is because it was found to
be prohibitively expensive to flush a new deletes file to disk on each update transaction.
 The file would then need to be cleaned only a few transactions l
 ater.  Because there is a transaction log there is no need to write the deletes to disk twice
and it is much faster to append to a file than create a new one and delete it possibly seconds
later.
  
  = How it Works =
  
@@ -64, +73 @@

  
  SOLR uses a schema.  I chose not to use a schema because the realtime index should be able
to change at any time.  Instead the raw Lucene field classes such as Store, TermVector, and
Indexed are exposed in the OceanObject class.  An analyzer is defined on a per field per OceanObject
basis.  Using serialization, this process is not slow and is not bulky over the network as
serialization performs referencing of redundant objects.  GData allows the user to store multiple
types for a single field.  For example, a field named battingaverage may contain fields of
type long and text.  I am really not sure how Google handles this underneath.  I decided to
use Solr's NumberUtils class that encodes numbers into sortable strings.  This allows range
queries and other enumerations of the field to return the values in their true order rather
than string order.  One method I came up with to handle potentially different types in a field
is to prepend a letter signifying the type of the val
 ue for untokenized fields.  For a string the value would be "s0.323" and for a long "l845445".
 This way when sorting or enumerating over the values they stay disparate and can be modified
to be their true value upon return of the call.  Perhaps there is a better method.
  
- Since writing the above I came up with an alternative mechanism to handle any number in
an Field.Index.UN_TOKENIZED field.  If the field string can be parsed into a double, then
the number is encoded using SOLR's NumberUtils into an encoded double string.  There may be
edge cases I am unaware of that make this system not work, but for right now it looks like
it will work.  In order to properly process a query, terms with fields in the query that are
Field.Index.UN_TOKENIZED will need to be checked for having a number by attempted parsing.
 If the value can be parsed into a double then it is encoded into a double string and replaced
in the term.  A similar process will be used for date strings which will conform to the ISO
8601 standard.  
+ Since writing the above I came up with an alternative mechanism to handle any number in
an Field.Index.UN_TOKENIZED field.  If the field string can be parsed into a double, then
the number is encoded using SOLR's NumberUtils into an encoded double string.  There may be
edge cases I am unaware of that make this system not work, but for right now it looks like
it will work.  In order to properly process a query, terms with fields in the query that are
Field.Index.UN_TOKENIZED will need to be checked for having a number by attempted parsing.
 If the value can be parsed into a double then it is encoded into a double string and replaced
in the term.  A similar process will be used for date strings which will conform to the ISO
8601 standard.  The user may also explicitly define the type of a number by naming the field
such as "battingaverage_double" where the type is defined by an underscore and then the type
name.  This is similar to Solr's dynamic fields construction.  
  
  = Tag Index =
  
- The tag index patch is located at [https://issues.apache.org/jira/browse/LUCENE-1292 LUCENE-1292].
 I had seen people mention using a ParallelReader to have an index that is static and an index
that is dynamic appear as one index.  The challenge with this type of system is to get the
doc numbers to stay aligned.  Google seems to have a realtime tag index system.  I figured
there must be some way using the Lucene architecture to achieve the same thing.  The method
I came up with is to divide the postings list into blocks.  Each block contains a set number
of documents, the blocks are not divided by actual byte size but by document number.  The
blocks are unified using a TagMultiTermDocs class.  When a block is changed it is written
to RAM.  Once the RAM usage hits a certain size, the disk and memory postings are merged to
disk.  There needs to be coordination between this process and the merging of the segments.
 Each Tag Index is associated with a segment.  In Ocean the mer
 ging of segments is performed by the Ocean code and not IndexWriter so the coordination does
not involve hooking into IndexWriter.  Currently there needs to be a way to obtain the doc
id from an addDocument call from IndexWriter.  This patch has not been created yet.  
+ The tag index patch is located at [https://issues.apache.org/jira/browse/LUCENE-1292 LUCENE-1292].
 I had seen people mention using a ParallelReader to have an index that is static and an index
that is dynamic appear as one index.  The challenge with this type of system is to get the
doc numbers to stay aligned.  Google seems to have a realtime tag index system.  I figured
there must be some way using the Lucene architecture to achieve the same thing.  The method
I came up with is to divide the postings list into blocks.  Each block contains a set number
of documents, the blocks are not divided by actual byte size but by document number.  The
blocks are unified using a TagMultiTermDocs class.  When a block is changed it is written
to RAM.  Once the RAM usage hits a certain size, the disk and memory postings are merged to
disk.  There needs to be coordination between this process and the merging of the segments.
 Each Tag Index is associated with a segment.  In Ocean the mer
 ging of segments is performed by the Ocean code and not IndexWriter so the coordination does
not involve hooking into IndexWriter.  Currently there needs to be a way to obtain the doc
id from an addDocument call from IndexWriter which needs a patch still.  
  
  = Distributed Search =
  
@@ -87, +96 @@

   * Filter caching
   * Distributed updates
   * Name service
-  * Rework the code to allow UUID strings as the transaction ids rather than longs for the
distributed updates
+  * Write the node asynchronous conflict resolution code and test cases
   * Test case for LargeBatch
  

Mime
View raw message