jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject NGP: Value records
Date Mon, 23 Apr 2007 18:30:40 GMT
Hi,

I started prototyping the next generation persistence proposal
discussed before, and would like feedback on an idea on how to store
values in this persistence model.

My idea is to store each value in a unique and immutable "value
record" identified by a "value identifier". Duplicate values are only
stored once in a single value record. This saves space especially when
storing multiple copies of large binary documents and allows value
equality comparisons based on just the identifiers.

A value record would essentially be an array of bytes as defined in
Value.getStream(). In other words the integer value 123 and the string
value "123" would both be stored in the same value record. More
specific typing information would be indicated in the property record
that refers to that value. For example an integer property and a
string property could both point to the same value record, but have
different property types that indicate the default interpretation of
the value.

Name and path values are stored as strings using namespace prefixes
from an internal namespace registry. Stability of such values is
enforced by restricting this internal namespace registry to never
remove or modify existing prefix mappings, only new namespace mappings
can be added.

Possible Optimizations

Extra metadata can be associated with the value records to avoid
having to parse the binary stream every time the value is accessed as
a typed value. Such metadata could include for example flags that
indicate if the byte array is valid UTF-8 and if it can be interpreted
as an integer, a float, a date, a name, a path, etc. Value records
that can be interpreted as types like integers or dates can also
contain a more efficient binary representation than the string-based
Value.getStream() byte array.

Storing values in separate value records violates locality-of-access
and can in the worst case cause separate disk loads for each value
being read. Since value records are immutable it is possible to offset
this problem by caching commonly accessed values in memory or by
putting copies of small values near places where the values are
referenced. For example a multivalued integer property could
internally be stored as an array that contains both the value
identifiers and the actual integers.

Achieving uniqueness of the value records requires a way to determine
whether an instance of a given value already exists. Some indexing is
needed to avoid having to traverse the entire set of existing value
records for each new value being created. A hash table with chained
entries could easily be managed in an append-only mode for easy
integration with the proposed revision model.

Draft interfaces

Here's a quick draft of the interfaces for such an implementation:

    interface ValueIdentifier {}

    interface ValueRecord {
        InputStream getStream() throws IOException;
    }

    interface Revision {
        /** Returns the identified value from this or any previous revision. */
        ValueRecord getValue(ValueIdentifier identifier);
    }

    interface DraftRevision extends Revision {
        /**
         * Returns the value identifier of a value record with the
given contents.
         * If such a record does not already exists, a new one is
created in this
         * revision.
         */
        ValueIdentifier createValue(InputStream stream) throws IOException;
    }

What do you think?

BR,

Jukka Zitting

Mime
View raw message