From oak-dev-return-2771-apmail-jackrabbit-oak-dev-archive=jackrabbit.apache.org@jackrabbit.apache.org Wed Oct 17 23:06:55 2012 Return-Path: X-Original-To: apmail-jackrabbit-oak-dev-archive@minotaur.apache.org Delivered-To: apmail-jackrabbit-oak-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 16F2DD736 for ; Wed, 17 Oct 2012 23:06:55 +0000 (UTC) Received: (qmail 59750 invoked by uid 500); 17 Oct 2012 23:06:55 -0000 Delivered-To: apmail-jackrabbit-oak-dev-archive@jackrabbit.apache.org Received: (qmail 59695 invoked by uid 500); 17 Oct 2012 23:06:55 -0000 Mailing-List: contact oak-dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: oak-dev@jackrabbit.apache.org Delivered-To: mailing list oak-dev@jackrabbit.apache.org Received: (qmail 59686 invoked by uid 99); 17 Oct 2012 23:06:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Oct 2012 23:06:54 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ianboston@gmail.com designates 209.85.160.42 as permitted sender) Received: from [209.85.160.42] (HELO mail-pb0-f42.google.com) (209.85.160.42) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Oct 2012 23:06:47 +0000 Received: by mail-pb0-f42.google.com with SMTP id ro2so11907641pbb.1 for ; Wed, 17 Oct 2012 16:06:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=1s7q0NL+Spi+Q3DyWsUth+LPuM4zV3pn99JQgAKbBz0=; b=mGbdlQgvjrH441tFvhf61oFDshGVMx7/2gf0i5g9Ug2nfp3Z3UlDtu+O+P3AWmTy4K zBj+vKVqCeKcYH/2zls32vj5uMeSIkOZsHo/pS7s+gXAN5mD+2jCbnZvCCqLCrJnI5Rj WKLSf2V+LEMnwE/+rUBytJgPc1AUhlbEXYLBlJbSwklKptu06wXTTxARaVa6JBX072q0 0k7yXZcIPrMmwp5LPdjsJ4HRzRAz2gAqlonv1TliYxUWuJsSUcDlQA54fqbqIyL04wZp LpSGRHsoRui/QPHh7AENR2V7AB2BevtYnuEQGPG7fQ0Zd/OreiV9KGkk45/m+t9xtFZ+ YIYQ== MIME-Version: 1.0 Received: by 10.68.132.41 with SMTP id or9mr61422436pbb.67.1350515186416; Wed, 17 Oct 2012 16:06:26 -0700 (PDT) Sender: ianboston@gmail.com Received: by 10.66.231.8 with HTTP; Wed, 17 Oct 2012 16:06:26 -0700 (PDT) In-Reply-To: References: Date: Thu, 18 Oct 2012 10:06:26 +1100 X-Google-Sender-Auth: wTRGtvoQN1zmTENUmHEK-AkiAWg Message-ID: Subject: Re: Question: wide hierachies From: Ian Boston To: oak-dev@jackrabbit.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hi Stefan, comments inline: On 18 October 2012 02:46, Stefan Guggisberg wrote: > hi ian, > > On Sun, Oct 14, 2012 at 2:49 AM, Ian Boston wrote: >> Hi, >> IIUC Oak has the potential to handle wide hierarchies where there are >> millions of child nodes to a parent and hence support some of the use >> cases commonly found in Social Media and cloud applications. This >> question is about the implementation to support that. >> >> IIUC from reading the code child nodes are accessed from the parent >> node, and to avoid the situation in JR2 where child nodes were >> represented as an array of objects stored with the parent node, child >> nodes in Oak are represented in a internal tree structure where the >> width of the tree is controlled to ensure concurrency and performance >> is maintained. (did I understand that correctly ?). > > in JR2 the list of child node entries (name-id pairs) is stored within the > parent node. while this is very efficient for small to medium sized child > node lists (up to a couple of k entries) read and write performance suffers > when the list grows very large. > > in the current MicroKernel implementation we use an htree based intermediate > tree structure for storing large child node collections. phew, thats what I understood. > >> >> In choosing this approach, was a possible alternative considered where >> the child node contains a single pointer to its parent, and the child >> node is retrieved by hashing it's path rather ie hash-f(path) === >> child-node-storage-id, (eg sha1(path) == child-node-storage-id ) >> rather than navigating from the parent. > > the data model of the current MicroKernel implementation is essentially > just a DAG of node objects. the versioning model is git-like, i.e. > every change results in a new immutable tree. > > i don't see how child->parent relationships could be implemented with > this model. IIUC with path-based hashing you'd sacrifice MVCC > or am i missing something? Hmm, good point, your right, you cant MVCC the tree structure, only the data within the tree, since the path -> pointer transformation is immutable, thats a pity. Thanks for explaining. Ian > >> >> In my exposure to use cases for social media and cloud like systems I >> have observed several things. Listing child nodes from a parent >> becomes rarer the closer the application moves towards social media. >> Direct pointer based access derived from a hierarchical url is the >> main use case. For non hierarchal pointer, searching becomes dominant. >> Occasionally listing is required but only as a variant of search. >> Scans or sorted scans as nearly always avoided as being impractical. >> >> In social media content, the cardinality of all parents is low making >> finding children of parents amenable to an inverted index approach. Ie >> the parent becomes a search keyword term. >> >> (the closer to Enterprise Content Management the application is the >> inverse of the above is true) >> >> If I understood all of that correctly, my question: >> >> In circumstances where the use cases are more to the social media end >> of the spectrum, will it be possible to invert the child-parent >> relationship within Oak to be a pointer from the child with an >> optional, additional inverted index should finding children of a >> specific parent be required ? > > WRT the MicroKernel implementation: > > IMO that's unfortunately not possible with the current git-like > approach (content-based identifiers, DAG of immutable versions). > > cheers > stefan > >> >> Apologies if this is already covered and I missed it. The archives at >> Markmail didn't turn anything up. >> >> Thanks >> Ian