lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "OceanRealtimeSearch" by JasonRutherglen
Date Sun, 31 Aug 2008 23:35:17 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by JasonRutherglen:
http://wiki.apache.org/lucene-java/OceanRealtimeSearch

------------------------------------------------------------------------------
  
  Ocean enables realtime search written in Java using Lucene.  It is currently in patch phase
at [http://issues.apache.org/jira/browse/LUCENE-1313 LUCENE-1313].  Ocean offers a way for
Lucene based applications to take advantage of realtime search.  Realtime search makes search
systems more like a database.  This is probably why Google calls it's system [http://code.google.com/apis/gdata/
GData].  GData is offered as an online service and not software.  Ocean addresses this by
providing the same functionality as GData open sourced for use in any project.  GData does
not provide facets, this is something that Ocean can provide in the future.  [http://code.google.com/apis/base/
GBase] which is a cousin of GData offers location based search.  Ocean offers location based
search using [http://sourceforge.net/projects/locallucene/ LocalLucene].  By open sourcing
realtime search more functionality may be built in over time by the community which is something
GData being an online se
 rvice cannot do.  Google does not offer realtime search in it's search appliance.  I am unaware
of other search vendors offering realtime search.  
  
- There is a good [http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337
article] written by Adam Bosworth who seems to have headed up the GData project at Google.
 I think many of his points are quite valid.  It is worth mentioning the main points of the
article here as they also define the positive attributed of the Ocean open source search system.
+ There is a good [http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337
article] written by Adam Bosworth who seems to have headed up the GData project at Google.
 I think many of his points are quite valid.  It is worth mentioning the main points of the
article here as they also define the positive attributes of the Ocean open source search system.
  
   * It is worth making things simple enough that one can harness Moore’s law in parallel
   * It is acceptable to be stale much of the time
@@ -17, +17 @@

  
  From a discussion with Karl Wettin:
  
- I was an early user of Solr when GData came out.  They were similar in that they were both
search exposed as XML.  GData however offered realtime search and Solr offered batch processing.
 I worked for a social networking company that wanted the updates available as fast as possible.
 It was hard to achieve anything below a couple of minutes as the queries the company wanted
used a sort.  In Lucene a sort loads the field cache into RAM which on a large index is expensive.
 There are ways to solve this but they were not available.  In any case I wanted to figure
out a way to allow updates to be searchable in a minimal amount of time as possible while
also offering functionality like SOLR of replication and facets.  The one thing GData had
over Solr was realtime updates or the ability to add, delete, or update a document and be
able to see the update in search results immediately.  With Solr the company had decided on
a 10 minute interval of updating the index with delta upda
 tes from an Oracle database.  I wanted to see if it was possible with Lucene to create an
approximation of what GData does.  The result is Ocean.
+ I was an early user of [http://lucene.apache.org/solr/ Solr] when GData came out.  They
were similar in that they were both search exposed as XML.  GData however offered realtime
search and Solr offered batch processing.  I worked for a social networking company that wanted
the updates available as fast as possible.  It was hard to achieve anything below a couple
of minutes as the queries the company wanted used a sort.  In Lucene a sort loads the field
cache into RAM which on a large index is expensive.  There are ways to solve this but they
were not available.  In any case I wanted to figure out a way to allow updates to be searchable
in a minimal amount of time as possible while also offering functionality like SOLR of replication
and facets.  The one thing GData had over Solr was realtime updates or the ability to add,
delete, or update a document and be able to see the update in search results immediately.
 With Solr the company had decided on a 10 minute interval of u
 pdating the index with delta updates from an Oracle database.  I wanted to see if it was
possible with Lucene to create an approximation of what GData does.  The result is Ocean.
  
  The use case it was designed for is websites with dynamic data, some of which are social
networking, photo sites, discussions boards, blogs, wikis, and such.  More broadly it is possible
to use Ocean with any application that requires the database like feature of immediate updates.
 Probably the best example of this is all of Google's web applications, outside of web search,
uses a GData interface.  Meaning the primary datastore is not mysql or some equivalent, it
is a proprietary search based database.  The best example of this is Gmail.  If I receive
an email through Gmail I can  also search on it immediately, there is no 10 minute delay.
 Also in Gmail I can change labels, a common example being changing unread emails to read
in bulk.  Presumably Gmail is not reindexing the entire email for each label change. 
  
@@ -47, +47 @@

  
  Each transaction is recorded in the transaction log which is a series of files with the
file name format log00000001.bin.  The suffix number and a new log file is created when the
current log file reaches a predefined size limit.  The class org.apache.lucene.ocean.log.LogFileManager
is responsible for this process.  
  
- The transaction record consists of three separate parts, the header, document bytes, and
other bytes.  The other bytes can store anything other than the documents, usually the deletes
serialized.  Each part has a CRC32 check which insures integrity of data.  The transaction
log can become corrupted if the process is stopped in the middle of a write.  There is a CRC32
check with each part because they are loaded separately at different times.  
+ The transaction record consists of three separate parts, the header, document bytes, and
other bytes.  The other bytes can store anything other than the documents, usually the deletes
serialized.  Each part has a CRC32 check which insures integrity of data.  The transaction
log can become corrupted if the process is stopped in the middle of a write.  There is a CRC32
check with each part because they are loaded separately at different times.  For example during
the recovery process on the Ocean server startup the documents are the first to be loaded
and in memory indexes are created.  Then the deletes from the transactions are executed. 
Then the indexes are optimized to remove the deleted documents.  The process described is
much faster than performing each transaction incrementally during recovery.  It is important
to note that internally each delete, especially the delete by query is saved as the actual
document ids that were deleted when the transaction was committed.  
 If the system simply re-executed the delete by query, then the transaction would create inconsistent
results.  
  
  = Replication =
  
@@ -83, +83 @@

  
  = Distributed Search =
  
- Distributed search with Ocean will use the http://issues.apache.org/jira/browse/LUCENE-1336
patch.  It provides RMI functionality over the Hadoop IPC protocol.  Using Hadoop IPC as a
transport has advantages over using Sun's RMI because it is simpler and uses NIO.  In large
systems using NIO reduces thread usage and allows the overall system to scale better.  LUCENE-1336
allows classes to be dynamically loaded by the server from the client on a per client basis
to avoid problems with classloaders and class versions.  Using a remote method invocation
system for me is much faster to implement functionality than when using Solr and implementing
XML interfaces and clients or using namedlists.  I prefer writing distributed code using Java
objects because they are what I am more comfortable with.  Also I worked on Jini and Sun and
one might say it is in the blood.  The idea to create a better technique for classloading
comes from my experiences and the failures of trying to imple
 ment Jini systems.  Search is a fairly straightforward non-changing problem and so the dynamic
classloading is only required by the server from the client.  By having a reduced scope problem
the solution was much easier to generate compared to working with Jini which attempted to
solve all potential problems even if they most likely do not exist.  
+ Distributed search with Ocean will use the http://issues.apache.org/jira/browse/LUCENE-1336
patch.  It provides RMI functionality over the Hadoop IPC protocol.  Using Hadoop IPC as a
transport has advantages over using Sun's RMI because it is simpler and uses [http://java.sun.com/j2se/1.4.2/docs/guide/nio/
NIO] (non blocking sockets).  In large systems using NIO reduces thread usage and allows the
overall system to scale better.  LUCENE-1336 allows classes to be dynamically loaded by the
server from the client on a per client basis to avoid problems with classloaders and class
versions.  Using a remote method invocation system for me is much faster to implement functionality
than when using Solr and implementing XML interfaces and clients or using namedlists.  I prefer
writing distributed code using Java objects because they are what I am more comfortable with.
 Also I worked on Jini and Sun and one might say it is in the blood.  The idea to create a
better technique for cl
 assloading comes from my experiences and the failures of trying to implement Jini systems.
 Search is a fairly straightforward non-changing problem and so the dynamic classloading is
only required by the server from the client.  By having a reduced scope problem the solution
was much easier to generate compared to working with Jini which attempted to solve all potential
problems even if they most likely do not exist.  
  
  In the future it is possible to write a servlet wrapper around the Ocean Java client and
expose the Ocean functionality as XML possibly conforming to [http://www.opensearch.org OpenSearch]
and/or GData.  
  

Mime
View raw message