lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "OceanRealtimeSearch" by JasonRutherglen
Date Tue, 26 Aug 2008 13:21:09 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by JasonRutherglen:
http://wiki.apache.org/lucene-java/OceanRealtimeSearch

------------------------------------------------------------------------------
  = Introduction =
  
- Ocean enables realtime search written in Java using Lucene.  It is currently in patch phase
at http://issues.apache.org/jira/browse/LUCENE-1313
+ Ocean enables realtime search written in Java using Lucene.  It is currently in patch phase
at [http://issues.apache.org/jira/browse/LUCENE-1313 LUCENE-1313].  Ocean offers a way for
Lucene based applications to take advantage of realtime search.  Realtime search makes search
systems more like a database.  This is probably why Google calls it's system [http://code.google.com/apis/gdata/
GData].  GData is offered as an online service and not software.  Ocean addresses this by
providing the same functionality as GData open sourced for use in any project.  GData does
not provide facets, this is something that Ocean can provide in the future.  [http://code.google.com/apis/base/
GBase] which is a cousin of GData offers location based search.  Ocean can offer the same
in the future.  By open sourcing realtime search more functionality may be built in over time
by the community which is something GData being an online service cannot do.  Google does
not offer realtime search in it
 's search appliance.  I am unaware of other search vendors offering realtime search.  
  
  = Background =
  
- From a discussion with Karl Wettin.
+ From a discussion with Karl Wettin:
  
- The one thing GData had over Solr was realtime updates or the ability to add, delete, or
update a document and be able to see the update in search results immediately.  With Solr
the company had decided on a 10 minute interval of updating the index with delta updates from
an Oracle database.  I wanted to see if it was possible with Lucene to create an approximation
of what GData does.  The result is Ocean.
+ I was an early user of Solr when GData came out.  They were similar in that they were both
search exposed as XML.  GData however offered realtime search and Solr offered batch processing.
 I worked for a social networking company that wanted the updates available as fast as possible.
 It was hard to achieve anything below a couple of minutes as the queries the company wanted
used a sort.  In Lucene a sort loads the field cache into RAM which on a large index is expensive.
 There are ways to solve this but they were not available.  In any case I wanted to figure
out a way to allow updates to be searchable in a minimal amount of time as possible while
also offering functionality like SOLR of replication and facets.  The one thing GData had
over Solr was realtime updates or the ability to add, delete, or update a document and be
able to see the update in search results immediately.  With Solr the company had decided on
a 10 minute interval of updating the index with delta upda
 tes from an Oracle database.  I wanted to see if it was possible with Lucene to create an
approximation of what GData does.  The result is Ocean.
  
  The use case it was designed for is websites with dynamic data, some of which are social
networking, photo sites, discussions boards, blogs, wikis, and such.  More broadly it is possible
to use Ocean with any application that requires the database like feature of immediate updates.
 Probably the best example of this is all of Google's web applications, outside of web search,
uses a GData interface.  Meaning the primary datastore is not mysql or some equivalent, it
is a proprietary search based database.  The best example of this is Gmail.  If I receive
an email through Gmail I can also search on it immediately, there is no 10 minute delay. 
Also in Gmail I can change labels, a common example being changing unread emails to read in
bulk.  Presumably Gmail is not reindexing the entire email for each label change. 
  
  Most highly trafficked web applications do not use the relational facilities like joins
because they are too expensive.  Lucene does not offer joins so this is fine.  The only area
Lucene is currently weak in is range queries.  Mysql uses a btree index whereas Lucene uses
the time consuming TermEnum and TermDocs combination.  This is an area Tag Index addresses.

  
  The way Ocean is designed there should be no limitations to using it compared to using Lucene
IndexWriter.  It offers the same functionality.  If one does not want to use the transaction
log Ocean offers because one simply wants to index 1 million documents at once, Ocean offers
what is a called a LargeBatch.  It is a way to perform a large number of updates taking advantage
of the new IndexWriter speedup, combined with transactional semantics. 
+ 
+ = What I Learned =
+ 
+ Merging is expensive and detrimental to realtime search.  The more merging that occurs during
the update call, the longer it takes for the update to become available.  Using IndexWriter.addDocument,
committing and then calling IndexReader.reopen takes time because a merge must occur during
the addDocument call.  I learned that I needed to design a system that would not perform merging
in the foreground during the update call, and have the merging performed in a background thread.
 Karl Wettin had created InstantiatedIndex and it took some time to figure out that it was
the right object to use to create an in memory representation of a document that would be
immediately searchable.  The issue of losing data is solved using the tried and true method
that Mysql uses which is a binary transaction log of what they call the queries, in Lucene
it is the documents.  
+ 
+ Lucene uses a snapshot system that is embodied in the IndexReader class.  Each IndexReader
is a snapshot of the index with associated files.  Ocean uses an IndexReader per snapshot
however the IndexReaders are created more often.  This means the IndexReaders are also disposed
of much more quickly than in a system like SOLR.  A lot of design work went into creating
a system that would allow the IndexReaders to be created and then to remove them when they
are no longer required.  A referencing system was created for each snapshot where Java code
may lock a snapshot, do work and unlock it.  Only a set number of snapshots need to be available
at a given time and the older unlocked snapshots are removed.  
  
  = How it Works =
  
@@ -32, +38 @@

  
  The transaction record consists of three separate parts, the header, document bytes, and
other bytes.  The other bytes can store anything other than the documents, usually the deletes
serialized.  Each part has a CRC32 check which insures integrity of data.  The transaction
log can become corrupted if the process is stopped in the middle of a write.  There CRC32
check with each part because they are loaded separately at different times.  
  
+ = Replication =
+ 
+ There are two ways to do replication and I have been leaning towards a non master slave
architecture.  I looked at the Paxos at War algorithm for master slave failover.  The problem
is, I did not understand it, and found it too complex to implement.  I tried other more simple
ways of implementing master slave failover and it still had major problems.  This led me to
look for another solution.  
+ 
+ Perhaps the best way to implement replication is to simply let the client handle the updates
to the nodes.  The client generates a globally unique object id and calls the remote update
method concurrently on the nodes.  In a master slave architecture the update is submitted
to the master first and then to the slaves which is not performed in parallel.  If there is
an error the update call may be revoked across the nodes.  If this fails there is a process
on each node to rectify transactions that are inconsistent with those of other nodes.  This
is more like how biology works I believe.  The Master slave architecture seems somewhat barbaric
in it's connotations.  
+ 
+ Because the Ocean system stores requires the entire document on an update and there is no
support for update set of specific fields like in SQL, it is much easier to rectify transactions
between nodes.  Meaning that deletes and updates of objects are less likely to clobber each
other during the rectification process.  
+ 
+ = Facets =
+ 
+ I wanted facets to work in realtime because it seemed like a challenging thing to do.  The
way I came up with to do this is a copy on read versioned LRU cache.  The bit sets for faceting
need to be cached.  The problem is, each transaction may perform deletes and the bit set needs
to reflect this.  Rather than perform deletes on all of the cached bit sets for each transaction
(which would consume a large amount of RAM and create a lot of garbage) a copy on read is
used.  The bit set cache stores the deletes docs of each snapshot/transaction.  If a given
bit set is required and the value is out of date then the deletes are applied to a new one.
 Each value in the cache stores multiple versions of a bit set.  Periodically as snapshots
are released by the system the older bit sets are also released.  This system is efficient
because only the used bit sets are brought up to date with the latest snapshot.
+ 
+ = Tag Index =
+ 
+ The tag index patch is located at [https://issues.apache.org/jira/browse/LUCENE-1292 LUCENE-1292].
 I had seen people mention using a ParallelReader to have an index that is static and an index
that is dynamic appear as one index.  The challenge with this type of system is to get the
doc numbers to stay aligned.  Google seems to have a realtime tag index system.  I figured
there must be some way using the Lucene architecture to achieve the same thing.  The method
I came up with is to divide the postings list into blocks.  Each block contains a set number
of documents, the blocks are not divided by actual byte size but by document number.  The
blocks are unified using a TagMultiTermDocs class.  When a block is changed it is written
to RAM.  Once the RAM usage hits a certain size, the disk and memory postings are merged to
disk.  There needs to be coordination between this process and the merging of the segments.
 Each Tag Index is associated with a segment.  In Ocean the mer
 ging of segments is performed by the Ocean code and not IndexWriter so the coordination does
not involve hooking into IndexWriter.  Currently there needs to be a way to obtain the doc
id from an addDocument call from IndexWriter.  This patch has not been created yet.  
+ 

Mime
View raw message