incubator-blur-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Blur Wiki] Update of "DataStructureDevelopment" by AaronMcCurry
Date Sun, 21 Oct 2012 18:56:58 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Blur Wiki" for change notification.

The "DataStructureDevelopment" page has been changed by AaronMcCurry:
http://wiki.apache.org/blur/DataStructureDevelopment?action=diff&rev1=2&rev2=3

  == Data Structure Development ==
  
- Any data structure development in Blur needs to have a manageable memory footprint.  The
easiest way achieve this behavior is to make the data structure file based (through the Lucene
Directory API).  Implementing a file based data structure will make use the BlockCacheDirectory
which will automatically cache block of the files in use.
+ Any data structure development in Blur needs to have a manageable memory footprint.  The
easiest way achieve this behavior is to make the data structure file based (through the Lucene
Directory API).  Implementing a file based data structure will make use the block cache directory
which will automatically cache blocks of the files in use.
  
  If a data structure cannot be written to use the file API then a manageable memory model
will have to be implemented.
  
  The reasoning for this is development strategy is three fold:
  
- * JVM Heap limitations
+  * JVM Heap limitations
- * Data grow issues
+  * Data grow issues
- * User query requirements
+  * User query requirements
  
  === JVM Heap limitations ===
  
- There many, many blogs discussing the limitations of the JVM heap.
+ The two main limitations are Garbage Collection, and overall size.  The overall size of
the heap currently is limited to around 16 GB of heap (assuming that you are NOT using a Zing,
or an Azul appliance).  There many, many blogs discussing the limitations of the GC and the
JVM heap.  
  
  === Data grow issues ===
  
- 
+ A goal that most clusters is to have enough RAM to hold the ''hot'' portions of the index
in memory.  However in some situations it may be required to load more data into a system
then is recommended.  This will cause the caching system to miss more often, however the system
will continue to operate.  If the same situation occurs with naive fully loaded in Heap data
structure the cluster could fail with the normal "Out Of Memory" exceptions.
  
  === User query requirements ===
  
+ User queries for the most part are short lived and require minimal amounts of heap space
the big except is sorting.  These queries require the ordering field(s) to be loaded into
memory.  Many improvements have been made in Lucene 4 when it comes to field caching, but
the default implementation loads the entire field contents into the heap.  In addition to
the on heap version, Lucene offers a separate implementation that will read the field contents
from files (Directory API), this should be the implementation that Blur will use to perform
sorting.
+ 
+ NOTE: These features in Lucene 4.0 are call Column Stride Fields.
+ 

Mime
View raw message