jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Mueller <muel...@adobe.com>
Subject Re: Design idea for a production-scale in-memory microkernel
Date Mon, 13 Aug 2012 14:59:59 GMT

I know this isn't related directly to "in-memory microkernel", but it
seems to me the reason to propose an "in-memory microkernel" is to improve
performance. Unless I misunderstood the mail?

As for Jackrabbit 2.x read and write performance, I found that JCR-2857
helps, specially for larger repositories - depending on the use case more
than an order of magnitude. This is without any additional heap memory.
Also, I found that the more sessions are open, the slower the writes, due
to internal event processing. So opening fewer sessions may help to
improve write performance.


On 8/9/12 10:29 AM, "Felix Meschberger" <fmeschbe@adobe.com> wrote:

>Interesting thoughts.
>To double on the in-memory assumption of the complete tree, I'd like to
>add that the most dramatic improvement in overall Jackrabbit performance
>on a configuration can probably be reached by increasing the bundle cache
>size which eventually is more or less what you are proposing.
>Am 07.08.2012 um 16:07 schrieb Jukka Zitting:
>> Hi,
>> [Just throwing an idea around, no active plans for further work on
>> One of the biggest performance bottlenecks with current repository
>> implementations is disk speed, especially seek times but also raw data
>> transfer rate in many cases. To work around those limitations we've in
>> Jackrabbit used various caching strategies that considerably
>> complicate the codebase and still have trouble with cache misses and
>> write-through performance.
>> As an alternative to such designs, I was thinking of a microkernel
>> implementation that would keep the *entire* tree structure in memory,
>> i.e. only use the disk or another backend for binaries and possibly
>> for periodic backup dumps. Fault tolerance against hardware failures
>> or other restarts would be achieved by requiring a clustered
>> deployment where all content is kept as copies on at least three
>> separate physical servers. Redis (http://redis.io/) is a good example
>> of the potential performance gains of such a design.
>> To estimate how much memory such a model would need, I looked at the
>> average bundle size of a vanilla CQ5 installation. There the average
>> bundle (i.e. a node with all its properties and child node references)
>> size is just 251 bytes. Even assuming larger bundles and some level of
>> storage and index overhead it seems safe to assume up to about 1kB of
>> memory per node on average. That would allow one to store some 1M
>> nodes in each 1GB of memory.
>> Assuming that all content is evenly spread across the cluster in a way
>> that puts copies of each individual bundle on at least three different
>> cluster nodes and that each cluster node additionally keeps a large
>> cache of most frequently accessed content, a large repository with
>> 100+M content nodes could easily run on a twelve-node cluster where
>> each cluster node has 32GB RAM, a reasonable size for a modern server
>> (also available from EC2 as m2.2xlarge). A mid-size repository with
>> 10+M content nodes could run on a three- or four-node cluster with
>> just 16GB RAM per cluster node (or m2.xlarge in EC2).
>> I believe such a microkernel could set a pretty high bar on
>> performance! The only major performance limit I foresee is the network
>> overhead when writing (need to send updates to other cluster nodes)
>> and during cache misses (need to retrieve data from other nodes), but
>> the cache misses would only start affecting repositories that go
>> beyond what fits in memory on a single server (i.e. the mid-size
>> repository described above wouldn't yet be hit by that limit) and the
>> write overhead could be amortized by allowing the nodes to temporarily
>> diverge until they have a chance to sync up again in the background
>> (as allowed by the MK contract).
>> BR,
>> Jukka Zitting

View raw message