incubator-blur-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Blur Wiki] Update of "InternalDataStructureDevelopment" by AaronMcCurry
Date Thu, 08 Nov 2012 15:16:45 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Blur Wiki" for change notification.

The "InternalDataStructureDevelopment" page has been changed by AaronMcCurry:

New page:
== Internal Data Structure Development ==

Any data structure development in Blur needs to have a manageable memory footprint.  The easiest
way achieve this behavior is to make the data structure file based (through the Lucene Directory
API).  Implementing a file based data structure will make use the block cache directory which
will automatically cache blocks of the files in use.

If a data structure cannot be written to use the file API then a manageable memory model will
have to be implemented.

The reasoning for this is development strategy is three fold:

 * JVM Heap limitations
 * Data grow issues
 * User query requirements

=== JVM Heap limitations ===

The two main limitations are Garbage Collection, and overall size.  The overall size of the
heap currently is limited to around 16 GB of heap (assuming that you are NOT using a Zing,
or an Azul appliance).  There many, many blogs discussing the limitations of the GC and the
JVM heap.  

=== Data grow issues ===

A goal that most clusters is to have enough RAM to hold the ''hot'' portions of the index
in memory.  However in some situations it may be required to load more data into a system
then is recommended.  This will cause the caching system to miss more often, however the system
will continue to operate.  If the same situation occurs with naive fully loaded in Heap data
structure the cluster could fail with the normal "Out Of Memory" exceptions.

=== User query requirements ===

==== Sorting ====

User queries for the most part are short lived and require minimal amounts of heap space the
big exception is sorting.  These queries require the ordering field(s) to be loaded into memory.
 Many improvements have been made in Lucene 4 when it comes to field caching, but the default
implementation loads the entire field contents into the heap.  In addition to the on heap
version, Lucene offers a separate implementation that will read the field contents from files
(Directory API), this should be the implementation that Blur will use to perform sorting.

NOTE: These features in Lucene 4.0 are call Column Stride Fields.

==== Filtering ====

The next largest memory consumer for user queries is filter caching.  For the most part this
is accomplished through weakly referenced bit sets that represent the filter the user requested.
 A file based solution has not yet been implemented, but should be.

View raw message