lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "OceanRealtimeSearch" by JasonRutherglen
Date Thu, 28 Aug 2008 18:18:43 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by JasonRutherglen:
http://wiki.apache.org/lucene-java/OceanRealtimeSearch

------------------------------------------------------------------------------
  
  Because the Ocean system stores the entire document on an update and there is no support
for update of specific fields like SQL, it is much easier to rectify transactions between
nodes.  Meaning that deletes and updates of objects are less likely to clobber each other
during the rectification process.  
  
+ = Crowding =
+ 
+ GBase mentions a feature that is perhaps somewhat interesting and this is [http://code.google.com/apis/base/attrs-queries.html#crowding
crowding].  It is similar to what in Solr is referred to as [https://issues.apache.org/jira/browse/SOLR-236
Field Collapsing] however the implementation for Ocean could be a little bit easier and more
efficient.  Solr's Field Collapse code performs a sort on the results first and then seems
to perform another query.  GBase allows only 2 fields to be crowded.  Also it would seem to
be easier to simply obtain more results than are needed and crowd a field similar to how the
NutchBean has a dedupField.  I have tried to implement this feature into Ocean and have been
unable to get it quite right. 
+ 
  = Facets =
  
- I wanted facets to work in realtime because it seemed like a challenging thing to do.  The
way I came up with to do this is a copy on read versioned LRU cache.  The bit sets for faceting
need to be cached.  The problem is, each transaction may perform deletes and the bit set needs
to reflect this.  Rather than perform deletes on all of the cached bit sets for each transaction
(which would consume a large amount of RAM and create a lot of garbage) a copy on read is
used.  The bit set cache stores the deletes docs of each snapshot/transaction.  If a given
bit set is required and the value is out of date then the deletes are applied to a new one.
 Each value in the cache stores multiple versions of a bit set.  Periodically as snapshots
are released by the system the older bit sets are also released.  This system is efficient
because only the used bit sets are brought up to date with the latest snapshot.
+ I wanted facets to work in realtime because it seemed like a challenging thing to do.  The
way I came up with to do this is a copy on read versioned LRU cache embodied in the BitSetLRUMap.
 The bit sets for faceting need to be cached.  The problem is, each transaction may perform
deletes and the bit set needs to reflect this to be accurate during an intersection call.
 Rather than perform deletes on all of the cached bit sets for each transaction (which would
consume a large amount of RAM and create a lot of garbage) a copy on read is used (deletes
are applied only when the value is read).  The bit set cache stores the deletes docs of each
snapshot/transaction.  If a given bit set is required and the value is out of date then the
deletes are applied to a new one.  Each value in the cache stores multiple versions of a bit
set.  Periodically as snapshots are released by the system the older bit sets are also released.
 This system is efficient because only the used bit sets a
 re brought up to date with the latest snapshot.
  
  Facet caching needs to be handled per segment and merged during the search results merging.
 
  
  = Storing the Data =
  
  SOLR uses a schema.  I chose not to use a schema because the realtime index should be able
to change at any time.  Instead the raw Lucene field classes such as Store, TermVector, and
Indexed are exposed in the OceanObject class.  An analyzer is defined on a per field per OceanObject
basis.  Using serialization, this process is not slow and is not bulky over the network as
serialization performs referencing of redundant objects.  GData allows the user to store multiple
types for a single field.  For example, a field named battingaverage may contain fields of
type long and text.  I am really not sure how Google handles this underneath.  I decided to
use Solr's NumberUtils class that encodes numbers into sortable strings.  This allows range
queries and other enumerations of the field to return the values in their true order rather
than string order.  One method I came up with to handle potentially different types in a field
is to prepend a letter signifying the type of the val
 ue for untokenized fields.  For a string the value would be "s0.323" and for a long "l845445".
 This way when sorting or enumerating over the values they stay disparate and can be modified
to be their true value upon return of the call.  Perhaps there is a better method.
+ 
+ Since writing the above I came up with an alternative mechanism to handle any number in
an Field.Index.UN_TOKENIZED field.  If the field string can be parsed into a double, then
the number is encoded using SOLR's NumberUtils into an encoded double string.  There may be
edge cases I am unaware of that make this system not work, but for right now it looks like
it will work.  In order to properly process a query, terms with fields in the query that are
Field.Index.UN_TOKENIZED will need to be checked for having a number by attempted parsing.
 If the value can be parsed into a double then it is encoded into a double string and replaced
in the term.  A similar process will be used for date strings which will conform to the ISO
8601 standard.  
  
  = Tag Index =
  

Mime
View raw message