Return-Path: X-Original-To: apmail-jackrabbit-commits-archive@www.apache.org Delivered-To: apmail-jackrabbit-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F1BBE10C2E for ; Thu, 24 Oct 2013 10:43:18 +0000 (UTC) Received: (qmail 90785 invoked by uid 500); 24 Oct 2013 10:43:15 -0000 Delivered-To: apmail-jackrabbit-commits-archive@jackrabbit.apache.org Received: (qmail 90619 invoked by uid 500); 24 Oct 2013 10:43:09 -0000 Mailing-List: contact commits-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list commits@jackrabbit.apache.org Received: (qmail 90577 invoked by uid 99); 24 Oct 2013 10:43:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Oct 2013 10:43:08 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Oct 2013 10:42:53 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id 8A1152388BF1; Thu, 24 Oct 2013 10:42:05 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1535337 [10/11] - in /jackrabbit/site/live/oak/docs: ./ css/ js/ Date: Thu, 24 Oct 2013 10:42:03 -0000 To: commits@jackrabbit.apache.org From: mduerig@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20131024104205.8A1152388BF1@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Modified: jackrabbit/site/live/oak/docs/query.html URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/query.html?rev=1535337&r1=1535336&r2=1535337&view=diff ============================================================================== --- jackrabbit/site/live/oak/docs/query.html (original) +++ jackrabbit/site/live/oak/docs/query.html Thu Oct 24 10:42:02 2013 @@ -1,507 +1,507 @@ - - - - - - - - - Jackrabbit Oak - - - - - - - - - - - - - - - - - - - Fork me on GitHub - - - - - - - - - -
- - - - - -
-
- -
- - -
- -
-

The Query Engine

-

Oak does not index content by default as does Jackrabbit 2. You need to create custom indexes when necessary, much like in traditional RDBMSs. If there is no index for a specific query, then the repository will be traversed. That is, the query will still work but probably be very slow.

-

Query Indices are defined under the oak:index node.

-
-

XPath to SQL2 Transformation

-

To support the XPath query language, such queries are internally converted to SQL2.

-

Every conversion is logged in debug level under the org.apache.jackrabbit.oak.query.QueryEngineImpl logger:

- -
-
org.apache.jackrabbit.oak.query.QueryEngineImpl Parsing xpath statement: 
-    //element(*)[@sling:resourceType = 'slingevent:Lock')]
-org.apache.jackrabbit.oak.query.QueryEngineImpl XPath > SQL2: 
-    select [jcr:path], [jcr:score], * from [nt:base] as a 
-    where [sling:resourceType] = 'slingevent:Lock' 
-    /* xpath: //element(*)[@sling:resourceType = 'slingevent:Lock' 
-    and @lock.created < xs:dateTime('2013-09-02T15:44:05.920+02:00')] */
-
-

Each transformed SQL2 query contains the original XPath query as a comment.

-
-

Query Processing

-

Internally, the query engine uses a cost based query optimizer that asks all the available query indexes for the estimated cost to process the query. It then uses the index with the lowest cost.

-

By default, the following indexes are available:

- -
    - -
  • A property index for each indexed property.
  • - -
  • A full-text index which is based on Apache Lucene / Solr.
  • - -
  • A node type index (which is based on an property index for the properties jcr:primaryType and jcr:mixins).
  • - -
  • A traversal index that iterates over a subtree.
  • -
-

If no index can efficiently process the filter condition, the nodes in the repository are traversed at the given subtree.

-

Usually, data is read from the index and repository while traversing over the query result. There are exceptions however, where all data is read in memory when the query is executed: when using a full-text index, and when using an “order by” clause.

-
-

The Property Index

-

Is useful whenever there is a query with a property constraint that is not full-text:

- -
-
SELECT * FROM [nt:base] WHERE [jcr:uuid] = $id
-
-

To define a property index on a subtree you have to add an index definition node that:

- -
    - -
  • must be of type oak:queryIndexDefinition
  • - -
  • must have the type property set to property
  • - -
  • contains the propertyNames property that indicates what properties will be stored in the index. propertyNames can be a list of properties, and it is optional.in case it is missing, the node name will be used as a property name reference value
  • -
-

Optionally you can specify

- -
    - -
  • a uniqueness constraint on a property index by setting the unique flag to true
  • - -
  • that the property index only applies to a certain node type by setting the declaringNodeTypes property
  • - -
  • the reindex flag which when set to true, triggers a full content re-index.
  • -
-

Example:

- -
-
{
-  NodeBuilder index = root.child("oak:index");
-  index.child("uuid")
-    .setProperty("jcr:primaryType", "oak:queryIndexDefinition", Type.NAME)
-    .setProperty("type", "property")
-    .setProperty("propertyNames", "jcr:uuid")
-    .setProperty("declaringNodeTypes", "mix:referenceable")
-    .setProperty("unique", true)
-    .setProperty("reindex", true);
-}
-
-

or to simplify you can use one of the existing IndexUtils#createIndexDefinition helper methods:

- -
-
{
-  NodeBuilder index = IndexUtils.getOrCreateOakIndex(root);
-  IndexUtils.createIndexDefinition(index, "myProp", true, false, ImmutableList.of("myProp"), null);
-}
-
-
-

The Lucene Full-Text Index

-

The full-text index handles only the ‘contains’ type of queries.

- -
-
//*[jcr:contains(., 'text')]
-
-

Not having a full-text index means that the full-text queries will not be able to work properly. Currently the query engine has a basic verification in place for full-text conditions, but that is brittle and can miss hits.

-

The full-text index update is asynchronous via a background thread, see Oak#withAsyncIndexing. This means that some full-text searches will not work for a small window of time: the background thread runs every 5 seconds, plus the time is takes to run the diff and to run the text-extraction process.

-

The async update status is now reflected on the oak:index node with the help of a few properties, see OAK-980

-

TODO Node aggregation OAK-828

-

The index definition node for a lucene-based full-text index:

- -
    - -
  • must be of type oak:queryIndexDefinition
  • - -
  • must have the type property set to lucene
  • - -
  • must contain the async property set to the value async, this is what sends the index update process to a background thread
  • -
-

Optionally you can add

- -
    - -
  • what subset of property types to be included in the index via the includePropertyTypes property
  • - -
  • a blacklist of property names: what property to be excluded from the index via the excludePropertyNames property
  • - -
  • the reindex flag which when set to true, triggers a full content re-index.
  • -
-

Example:

- -
-
{
-  NodeBuilder index = root.child("oak:index");
-  index.child("lucene")
-    .setProperty("jcr:primaryType", "oak:queryIndexDefinition", Type.NAME)
-    .setProperty("type", "lucene")
-    .setProperty("async", "async")
-    .setProperty(PropertyStates.createProperty("includePropertyTypes", ImmutableSet.of(
-        PropertyType.TYPENAME_STRING, PropertyType.TYPENAME_BINARY), Type.STRINGS))
-    .setProperty(PropertyStates.createProperty("excludePropertyNames", ImmutableSet.of( 
-        "jcr:createdBy", "jcr:lastModifiedBy"), Type.STRINGS))
-    .setProperty("reindex", true);
-}
-
-
-

The Solr Full-Text Index

-

TODO

-
-

The Node Type Index

-

The NodeTypeIndex implements a QueryIndex using PropertyIndexLookups on jcr:primaryType jcr:mixinTypes to evaluate a node type restriction on the filter. The cost for this index is the sum of the costs of the PropertyIndexLookup for queries on jcr:primaryType and jcr:mixinTypes.

-
-

Cost Calculation

-

Each query index is expected to estimate the worst-case cost to query with the given filter. The returned value is between 1 (very fast; lookup of a unique node) and the estimated number of entries to traverse, if the cursor would be fully read, and if there could in theory be one network round-trip or disk read operation per node (this method may return a lower number if the data is known to be fully in memory).

-

The returned value is supposed to be an estimate and doesn’t have to be very accurate. Please note this method is called on each index whenever a query is run, so the method should be reasonably fast (not read any data itself, or at least not read too much data).

-

If an index implementation can not query the data, it has to return Double.POSITIVE_INFINITY.

-

TODO Traversal warnings

-
-
-
- -
- - - + + + + + + + + + Jackrabbit Oak - + + + + + + + + + + + + + + + + + + Fork me on GitHub + + + + + + + + + +
+ + + + + +
+
+ +
+ + +
+ +
+

The Query Engine

+

Oak does not index content by default as does Jackrabbit 2. You need to create custom indexes when necessary, much like in traditional RDBMSs. If there is no index for a specific query, then the repository will be traversed. That is, the query will still work but probably be very slow.

+

Query Indices are defined under the oak:index node.

+
+

XPath to SQL2 Transformation

+

To support the XPath query language, such queries are internally converted to SQL2.

+

Every conversion is logged in debug level under the org.apache.jackrabbit.oak.query.QueryEngineImpl logger:

+ +
+
org.apache.jackrabbit.oak.query.QueryEngineImpl Parsing xpath statement: 
+    //element(*)[@sling:resourceType = 'slingevent:Lock')]
+org.apache.jackrabbit.oak.query.QueryEngineImpl XPath > SQL2: 
+    select [jcr:path], [jcr:score], * from [nt:base] as a 
+    where [sling:resourceType] = 'slingevent:Lock' 
+    /* xpath: //element(*)[@sling:resourceType = 'slingevent:Lock' 
+    and @lock.created < xs:dateTime('2013-09-02T15:44:05.920+02:00')] */
+
+

Each transformed SQL2 query contains the original XPath query as a comment.

+
+

Query Processing

+

Internally, the query engine uses a cost based query optimizer that asks all the available query indexes for the estimated cost to process the query. It then uses the index with the lowest cost.

+

By default, the following indexes are available:

+ +
    + +
  • A property index for each indexed property.
  • + +
  • A full-text index which is based on Apache Lucene / Solr.
  • + +
  • A node type index (which is based on an property index for the properties jcr:primaryType and jcr:mixins).
  • + +
  • A traversal index that iterates over a subtree.
  • +
+

If no index can efficiently process the filter condition, the nodes in the repository are traversed at the given subtree.

+

Usually, data is read from the index and repository while traversing over the query result. There are exceptions however, where all data is read in memory when the query is executed: when using a full-text index, and when using an “order by” clause.

+
+

The Property Index

+

Is useful whenever there is a query with a property constraint that is not full-text:

+ +
+
SELECT * FROM [nt:base] WHERE [jcr:uuid] = $id
+
+

To define a property index on a subtree you have to add an index definition node that:

+ +
    + +
  • must be of type oak:queryIndexDefinition
  • + +
  • must have the type property set to property
  • + +
  • contains the propertyNames property that indicates what properties will be stored in the index. propertyNames can be a list of properties, and it is optional.in case it is missing, the node name will be used as a property name reference value
  • +
+

Optionally you can specify

+ +
    + +
  • a uniqueness constraint on a property index by setting the unique flag to true
  • + +
  • that the property index only applies to a certain node type by setting the declaringNodeTypes property
  • + +
  • the reindex flag which when set to true, triggers a full content re-index.
  • +
+

Example:

+ +
+
{
+  NodeBuilder index = root.child("oak:index");
+  index.child("uuid")
+    .setProperty("jcr:primaryType", "oak:queryIndexDefinition", Type.NAME)
+    .setProperty("type", "property")
+    .setProperty("propertyNames", "jcr:uuid")
+    .setProperty("declaringNodeTypes", "mix:referenceable")
+    .setProperty("unique", true)
+    .setProperty("reindex", true);
+}
+
+

or to simplify you can use one of the existing IndexUtils#createIndexDefinition helper methods:

+ +
+
{
+  NodeBuilder index = IndexUtils.getOrCreateOakIndex(root);
+  IndexUtils.createIndexDefinition(index, "myProp", true, false, ImmutableList.of("myProp"), null);
+}
+
+
+

The Lucene Full-Text Index

+

The full-text index handles only the ‘contains’ type of queries.

+ +
+
//*[jcr:contains(., 'text')]
+
+

Not having a full-text index means that the full-text queries will not be able to work properly. Currently the query engine has a basic verification in place for full-text conditions, but that is brittle and can miss hits.

+

The full-text index update is asynchronous via a background thread, see Oak#withAsyncIndexing. This means that some full-text searches will not work for a small window of time: the background thread runs every 5 seconds, plus the time is takes to run the diff and to run the text-extraction process.

+

The async update status is now reflected on the oak:index node with the help of a few properties, see OAK-980

+

TODO Node aggregation OAK-828

+

The index definition node for a lucene-based full-text index:

+ +
    + +
  • must be of type oak:queryIndexDefinition
  • + +
  • must have the type property set to lucene
  • + +
  • must contain the async property set to the value async, this is what sends the index update process to a background thread
  • +
+

Optionally you can add

+ +
    + +
  • what subset of property types to be included in the index via the includePropertyTypes property
  • + +
  • a blacklist of property names: what property to be excluded from the index via the excludePropertyNames property
  • + +
  • the reindex flag which when set to true, triggers a full content re-index.
  • +
+

Example:

+ +
+
{
+  NodeBuilder index = root.child("oak:index");
+  index.child("lucene")
+    .setProperty("jcr:primaryType", "oak:queryIndexDefinition", Type.NAME)
+    .setProperty("type", "lucene")
+    .setProperty("async", "async")
+    .setProperty(PropertyStates.createProperty("includePropertyTypes", ImmutableSet.of(
+        PropertyType.TYPENAME_STRING, PropertyType.TYPENAME_BINARY), Type.STRINGS))
+    .setProperty(PropertyStates.createProperty("excludePropertyNames", ImmutableSet.of( 
+        "jcr:createdBy", "jcr:lastModifiedBy"), Type.STRINGS))
+    .setProperty("reindex", true);
+}
+
+
+

The Solr Full-Text Index

+

TODO

+
+

The Node Type Index

+

The NodeTypeIndex implements a QueryIndex using PropertyIndexLookups on jcr:primaryType jcr:mixinTypes to evaluate a node type restriction on the filter. The cost for this index is the sum of the costs of the PropertyIndexLookup for queries on jcr:primaryType and jcr:mixinTypes.

+
+

Cost Calculation

+

Each query index is expected to estimate the worst-case cost to query with the given filter. The returned value is between 1 (very fast; lookup of a unique node) and the estimated number of entries to traverse, if the cursor would be fully read, and if there could in theory be one network round-trip or disk read operation per node (this method may return a lower number if the data is known to be fully in memory).

+

The returned value is supposed to be an estimate and doesn’t have to be very accurate. Please note this method is called on each index whenever a query is run, so the method should be reasonably fast (not read any data itself, or at least not read too much data).

+

If an index implementation can not query the data, it has to return Double.POSITIVE_INFINITY.

+

TODO Traversal warnings

+
+
+
+ +
+ + + \ No newline at end of file Modified: jackrabbit/site/live/oak/docs/segmentmk.html URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/segmentmk.html?rev=1535337&r1=1535336&r2=1535337&view=diff ============================================================================== --- jackrabbit/site/live/oak/docs/segmentmk.html (original) +++ jackrabbit/site/live/oak/docs/segmentmk.html Thu Oct 24 10:42:02 2013 @@ -1,417 +1,417 @@ - - - - - - - - - Jackrabbit Oak - - - - - - - - - - - - - - - - - - - Fork me on GitHub - - - - - - - - - -
- - - - - -
-
- -
- - -
- -

SegmentMK design overview

-

Segments

-

The content tree and all its revisions are stored in a collection of immutable segments. Each segment is identified by a UUID and typically contains a continuous subset of the content tree. Some segments might also be used to store commonly occurring property values or other shared data. Segments range from a few dozens of bytes to up to 256kB in size and are stored as documents in a MongoDB collection.

-

Since segments are immutable, it’s easy for a client to keep a local in-memory cache of frequently accessed segments. Since segments also leverage locality of reference, i.e. nearby nodes are often stored in the same segment, it’s common for things like small child nodes to already exist in the cache by the time they get accessed. The intention is to avoid as many network or database roundtrips as possible by using local cache memory as efficiently as possible.

-

Content within a segment can contain references to content within other segments. Each segment keeps a list of the UUIDs of all other segments it references. This list of segment references can be used to optimize both internal storage (as seen below) and garbage collection. Segments that are no longer referenced can be efficiently identified by traversing the graph of segment-level references without having to parse or even fetch the contents of each segment.

-

The internal record structure of nodes is described in a moment once we first cover journal documents.

-

Journals

-

Journals are special, atomically updated documents that record the state of the repository as a sequence of references to successive root node records.

-

A small system could consist of just a single journal and would serialize all repository updates through atomic updates of that journal. A larger system that needs more write throughput can have more journals, linked to each other in a tree hierarchy. Commits to journals in lower levels of the tree can proceed concurrently, but will need to be periodically merged back to the root journal. Potential conflicts and resulting data loss or inconsistency caused by such merges can be avoided by always committing against the root journal.

-

Temporary branches used for large commits are also recorded as journals. A new private journal document is created for each branch and kept around until the branch gets merged or discarded. Branch journals contain an update timestamp that needs to be periodically refreshed by the client to prevent the branch from expiring and being reclaimed by the garbage collector.

-

The root node references stored in journals are used as the starting point for garbage collection. It is assumed that all content currently visible to clients must be accessible through at least one of the journals. If a client wants to keep a reference to some old content revision that’s no longer referenced by one of the main journals, it should create an empty private branch based on that revision and keep refreshing the branch until that content is no longer needed.

-

Records

-

The content inside a segment is divided in records of different types: blocks, lists, maps, values, templates and nodes. These record types and their internal structures are described in subsections below.

-

Each record is uniquely addressable by its location within the segment and the UUID of that segment. A single segment can contain up to 256kB of data and and references to up to 256 segments (including itself). Since all records are aligned at four-byte boundaries, 16 bits are needed to address all possible record locations within a segment. Thus only three bytes are needed to store a reference to any record in any segment (1 byte to identify the segment, 2 bytes for the record offset).

-
-

Block records

-

Blocks are binary records of up to 4kB. They’re used as building blocks of large binary (or string) values and stored as-is with no extra metadata or structure. Blocks are the only record type that can’t contain references to other records. Block records are typically stored in bulk segments that consist only of block records and are thus easily identifiable as containing zero references to other segments.

-
-

List records

-

List records are used as components of more complex record types. Lists are used for storing arrays of values for multi-valued properties and sequences of blocks for large binary values.

-

The list of references is split into pieces of up to 2^B references each (exact size TBD, B ~ 8) and those pieces are stored as records. If there are more than 2^B pieces like that, then a higher-level list is created of references to those pieces. This process is continued until the resulting list has less than 2^B entries. That top-level list is stored as a record prefixed with the total length of the list.

-

The result is a hierarchically stored immutable list where each element can be accessed in O(log N) time and the size overhead of updating or appending list elements (and thus creating a new immutable list) is also O(log N).

-
-

Map records

-

Like lists, maps are components of more complex record types. Maps store unordered sets of key-value pairs of record references and are used for nodes with a large number of properties or child nodes.

-

Maps are stored using the hash array mapped trie (HAMT) data structure. The hash code of each key is split into pieces of B bits each (exact size TBD, B ~ 6) and the keys are sorted into 2^B packs based on the first B bits. If a pack contains less than 2^B entries, then it is stored directly as a list of key-value pairs. Otherwise the keys are split into sub-packs based on the next B bits of their hash codes. When all packs are stored, the list of top-level pack references gets stored along with the total number of entries in the map.

-

The result is a hierarchically stored immutable map where each element can be accessed in O(log N) time and the size overhead of updating or inserting list elements is also O(log N).

-
-

Value records

-

Value records are byte arrays used for storing all names and values of the content tree. Since item names can be thought of as name values and since all JCR and Oak values can be expressed in binary form, it is easiest to simply use that form for storing all values. The size overhead of such a form for small value types like booleans or dates is amortized by the facts that those types are used only for a minority of values in typical content trees and that repeating copies of a value can be stored just once.

-

Small values, up to N kB (exact size TBD, N ~ 32), are stored inline in the record, prefixed by a byte or two to indicate the length of the value. Larger values are split into a list of fixed-size blocks and a possibly smaller tail block, and the value is stored as a list of block references.

-
-

Template records

-

A template record describes the common structure of a family of related nodes. Since the structures of most nodes in a typical content tree fall into a small set of common templates, it makes sense to store such templates separately instead of repeating that information separately for each node. For example, the property names and types as well as child node names of all nt:file nodes are typically the same. The presence of mixins and different subtypes increases the number of different templates, but they’re typically still far fewer than nodes in the repository.

-

A template record consists of a set of up to N (exact size TBD, N ~ 256) property name and type pairs. Additionally, since nodes that are empty or contain just a single child node are most common, a template record also contains information whether the node has zero, one or many child nodes. In case of a single child node, the template also contains the name of that node. For example, the template for typical mix:versionable nt:file nodes would be (using CND-like notation):

- -
-
- jcr:primaryType (NAME)
-- jcr:mixinTypes (NAME) multiple
-- jcr:created (DATE)
-- jcr:uuid (STRING)
-- jcr:versionHistory (REFERENCE)
-- jcr:predecessors (REFERENCE) multiple
-- jcr:baseVersion (REFERENCE)
-+ jcr:content
-
-

The names used in a template are stored as separate value records and included by reference. This way multiple templates that for example all contain the “jcr:primaryType” property name don’t need to repeatedly store it.

-
-

Node records

-

The overall structure of the content tree is stored in node records. Node records hold the actual content structure of the repository.

-

A typical node record consists of a template reference followed by property value references (list references for multivalued properties) and zero, one or more child node entries as indicated by the template. If the node has more than one child nodes, then those entries are stored as an array of name-node pairs of references.

-

A node that contains more than N properties or M child nodes (exact size TBD, M ~ 1k) is stored differently, using map records for the properties and child nodes. This way a node can become arbitrarily large and still remain reasonably efficient to access and modify. The main downside of this alternative storage layout is that the ordering of child nodes is lost.

-
-
-
- -
- - - + + + + + + + + + Jackrabbit Oak - + + + + + + + + + + + + + + + + + + Fork me on GitHub + + + + + + + + + +
+ + + + + +
+
+ +
+ + +
+ +

SegmentMK design overview

+

Segments

+

The content tree and all its revisions are stored in a collection of immutable segments. Each segment is identified by a UUID and typically contains a continuous subset of the content tree. Some segments might also be used to store commonly occurring property values or other shared data. Segments range from a few dozens of bytes to up to 256kB in size and are stored as documents in a MongoDB collection.

+

Since segments are immutable, it’s easy for a client to keep a local in-memory cache of frequently accessed segments. Since segments also leverage locality of reference, i.e. nearby nodes are often stored in the same segment, it’s common for things like small child nodes to already exist in the cache by the time they get accessed. The intention is to avoid as many network or database roundtrips as possible by using local cache memory as efficiently as possible.

+

Content within a segment can contain references to content within other segments. Each segment keeps a list of the UUIDs of all other segments it references. This list of segment references can be used to optimize both internal storage (as seen below) and garbage collection. Segments that are no longer referenced can be efficiently identified by traversing the graph of segment-level references without having to parse or even fetch the contents of each segment.

+

The internal record structure of nodes is described in a moment once we first cover journal documents.

+

Journals

+

Journals are special, atomically updated documents that record the state of the repository as a sequence of references to successive root node records.

+

A small system could consist of just a single journal and would serialize all repository updates through atomic updates of that journal. A larger system that needs more write throughput can have more journals, linked to each other in a tree hierarchy. Commits to journals in lower levels of the tree can proceed concurrently, but will need to be periodically merged back to the root journal. Potential conflicts and resulting data loss or inconsistency caused by such merges can be avoided by always committing against the root journal.

+

Temporary branches used for large commits are also recorded as journals. A new private journal document is created for each branch and kept around until the branch gets merged or discarded. Branch journals contain an update timestamp that needs to be periodically refreshed by the client to prevent the branch from expiring and being reclaimed by the garbage collector.

+

The root node references stored in journals are used as the starting point for garbage collection. It is assumed that all content currently visible to clients must be accessible through at least one of the journals. If a client wants to keep a reference to some old content revision that’s no longer referenced by one of the main journals, it should create an empty private branch based on that revision and keep refreshing the branch until that content is no longer needed.

+

Records

+

The content inside a segment is divided in records of different types: blocks, lists, maps, values, templates and nodes. These record types and their internal structures are described in subsections below.

+

Each record is uniquely addressable by its location within the segment and the UUID of that segment. A single segment can contain up to 256kB of data and and references to up to 256 segments (including itself). Since all records are aligned at four-byte boundaries, 16 bits are needed to address all possible record locations within a segment. Thus only three bytes are needed to store a reference to any record in any segment (1 byte to identify the segment, 2 bytes for the record offset).

+
+

Block records

+

Blocks are binary records of up to 4kB. They’re used as building blocks of large binary (or string) values and stored as-is with no extra metadata or structure. Blocks are the only record type that can’t contain references to other records. Block records are typically stored in bulk segments that consist only of block records and are thus easily identifiable as containing zero references to other segments.

+
+

List records

+

List records are used as components of more complex record types. Lists are used for storing arrays of values for multi-valued properties and sequences of blocks for large binary values.

+

The list of references is split into pieces of up to 2^B references each (exact size TBD, B ~ 8) and those pieces are stored as records. If there are more than 2^B pieces like that, then a higher-level list is created of references to those pieces. This process is continued until the resulting list has less than 2^B entries. That top-level list is stored as a record prefixed with the total length of the list.

+

The result is a hierarchically stored immutable list where each element can be accessed in O(log N) time and the size overhead of updating or appending list elements (and thus creating a new immutable list) is also O(log N).

+
+

Map records

+

Like lists, maps are components of more complex record types. Maps store unordered sets of key-value pairs of record references and are used for nodes with a large number of properties or child nodes.

+

Maps are stored using the hash array mapped trie (HAMT) data structure. The hash code of each key is split into pieces of B bits each (exact size TBD, B ~ 6) and the keys are sorted into 2^B packs based on the first B bits. If a pack contains less than 2^B entries, then it is stored directly as a list of key-value pairs. Otherwise the keys are split into sub-packs based on the next B bits of their hash codes. When all packs are stored, the list of top-level pack references gets stored along with the total number of entries in the map.

+

The result is a hierarchically stored immutable map where each element can be accessed in O(log N) time and the size overhead of updating or inserting list elements is also O(log N).

+
+

Value records

+

Value records are byte arrays used for storing all names and values of the content tree. Since item names can be thought of as name values and since all JCR and Oak values can be expressed in binary form, it is easiest to simply use that form for storing all values. The size overhead of such a form for small value types like booleans or dates is amortized by the facts that those types are used only for a minority of values in typical content trees and that repeating copies of a value can be stored just once.

+

Small values, up to N kB (exact size TBD, N ~ 32), are stored inline in the record, prefixed by a byte or two to indicate the length of the value. Larger values are split into a list of fixed-size blocks and a possibly smaller tail block, and the value is stored as a list of block references.

+
+

Template records

+

A template record describes the common structure of a family of related nodes. Since the structures of most nodes in a typical content tree fall into a small set of common templates, it makes sense to store such templates separately instead of repeating that information separately for each node. For example, the property names and types as well as child node names of all nt:file nodes are typically the same. The presence of mixins and different subtypes increases the number of different templates, but they’re typically still far fewer than nodes in the repository.

+

A template record consists of a set of up to N (exact size TBD, N ~ 256) property name and type pairs. Additionally, since nodes that are empty or contain just a single child node are most common, a template record also contains information whether the node has zero, one or many child nodes. In case of a single child node, the template also contains the name of that node. For example, the template for typical mix:versionable nt:file nodes would be (using CND-like notation):

+ +
+
- jcr:primaryType (NAME)
+- jcr:mixinTypes (NAME) multiple
+- jcr:created (DATE)
+- jcr:uuid (STRING)
+- jcr:versionHistory (REFERENCE)
+- jcr:predecessors (REFERENCE) multiple
+- jcr:baseVersion (REFERENCE)
++ jcr:content
+
+

The names used in a template are stored as separate value records and included by reference. This way multiple templates that for example all contain the “jcr:primaryType” property name don’t need to repeatedly store it.

+
+

Node records

+

The overall structure of the content tree is stored in node records. Node records hold the actual content structure of the repository.

+

A typical node record consists of a template reference followed by property value references (list references for multivalued properties) and zero, one or more child node entries as indicated by the template. If the node has more than one child nodes, then those entries are stored as an array of name-node pairs of references.

+

A node that contains more than N properties or M child nodes (exact size TBD, M ~ 1k) is stored differently, using map records for the properties and child nodes. This way a node can become arbitrarily large and still remain reasonably efficient to access and modify. The main downside of this alternative storage layout is that the ordering of child nodes is lost.

+
+
+
+ +
+ + + \ No newline at end of file