From dev-return-34362-apmail-jackrabbit-dev-archive=jackrabbit.apache.org@jackrabbit.apache.org Thu Mar 1 14:13:00 2012 Return-Path: X-Original-To: apmail-jackrabbit-dev-archive@www.apache.org Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5AC739DD9 for ; Thu, 1 Mar 2012 14:13:00 +0000 (UTC) Received: (qmail 96535 invoked by uid 500); 1 Mar 2012 14:13:00 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 96497 invoked by uid 500); 1 Mar 2012 14:13:00 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 96488 invoked by uid 99); 1 Mar 2012 14:13:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Mar 2012 14:13:00 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mueller@adobe.com designates 64.18.1.27 as permitted sender) Received: from [64.18.1.27] (HELO exprod6og111.obsmtp.com) (64.18.1.27) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Mar 2012 14:12:52 +0000 Received: from outbound-smtp-1.corp.adobe.com ([192.150.11.134]) by exprod6ob111.postini.com ([64.18.5.12]) with SMTP ID DSNKT0+Dzr6EtDQdOF5cwpy8c/CKo0pFA53x@postini.com; Thu, 01 Mar 2012 06:12:31 PST Received: from inner-relay-4.eur.adobe.com (inner-relay-4.adobe.com [193.104.215.14]) by outbound-smtp-1.corp.adobe.com (8.12.10/8.12.10) with ESMTP id q21EATJ0003819 for ; Thu, 1 Mar 2012 06:10:29 -0800 (PST) Received: from nacas01.corp.adobe.com (nacas01.corp.adobe.com [10.8.189.99]) by inner-relay-4.eur.adobe.com (8.12.10/8.12.9) with ESMTP id q21ECLPm022956 for ; Thu, 1 Mar 2012 06:12:28 -0800 (PST) Received: from eurcas01.eur.adobe.com (10.128.4.27) by nacas01.corp.adobe.com (10.8.189.99) with Microsoft SMTP Server (TLS) id 8.3.192.1; Thu, 1 Mar 2012 06:12:21 -0800 Received: from eurmbx01.eur.adobe.com ([10.128.4.32]) by eurcas01.eur.adobe.com ([10.128.4.27]) with mapi; Thu, 1 Mar 2012 14:12:19 +0000 From: Thomas Mueller To: "dev@jackrabbit.apache.org" Date: Thu, 1 Mar 2012 14:12:17 +0000 Subject: Re: [jr3] clustering Thread-Topic: [jr3] clustering Thread-Index: Acz3tVAohyhwtzkbRYavZ6GKRAkmqg== Message-ID: In-Reply-To: <4F4F7EB1.6040906@apache.org> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.14.0.111121 acceptlanguage: en-US Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Hi, I didn't read all the details, but I think this is a promising approach. Partitioning is flexible, so cluster nodes can be added and removed at runtime without downtime. I believe we could use the 'virtual repository' approach as the base for this implementation. That means, we first build a solution with fixed and manual partitioning, and once we have that we can extend it towards flexible partitioning. It seems failover and synchronizing changes (for content that is stored in multiple cluster nodes) would need to be solved in another way, which also fits well into the picture. Regards, Thomas and performance should be very good. On 3/1/12 2:50 PM, "Michael D=FCrig" wrote: >Hi Marcel, > >Nice to see new ideas wrt. to clustering. I quickly skimmed over the HP >paper and think that approach sounds promising. Having a flexible >partitioning scheme which also supports dynamic repartioning (i.e. >additions of server, migration of nodes) seems a better approach to me >than fixed partitioning by path as we discussed earlier. > >An open question though is how replication would fit into this picture. >There is some mention in the paper about backup nodes for fail-over. Not >sure if that is what we are aiming for or whether we want to go beyond >that. > >The paper assumes network connections to be reliable (i.e. no messages >altered, dropped or duplicated). However there is no mention on how the >system would recover from a partitioned network. That is, how it would >recover when some links go down and come up later. However, since it >uses 2 phase commit, I think it would basically inherit that behaviour >which means cluster nodes could become blocked (See [1] proposition 7.1 >and 7.2). > >OTOH the combination of optimistic locking during the transaction itself >and pessimistic locking only for the commit itself will probably result >in very good write throughput. Even more so since probably in many cases >there is only a single node involved in the transaction such that a >simple commit suffices. > >More comments see inline below. > >[1] http://research.microsoft.com/en-us/people/philbe/ccontrol.aspx > >On 1.3.12 11:05, Marcel Reutegger wrote: > >[...] >> >> so, I was thinking of something similar as described in this >> paper [1] or similar [2]. since a B-tree is basically an ordered >> list of items we'd have to linearize the JCR or MK hierarchy. I'm >> not sure whether a depth or breadth first traversal is >> better suited. maybe there even exists a more sophisticated >> space filling curve, which is a combination of both. linearizing >> the hierarchy on a B-tree should give some since locality for >> nodes that are hierarchically close and probability is high that >> they are requested in succession. > >Node types may give hints here. As long as they are not recursive (i.e. >nt:hierarchy) node types usually define "things that belong together". > >[...] > >> Open questions: >> >> how does MVCC fit into this? multiple revisions of the same >> JCR/MK node could be stored on a B-tree node. whenever >> an update happens the garbage collection could kick in an >> purge outdated revisions. providing a consistent journal across >> all servers is not clear to me right now. > >I think MVCC is not a problem as such. To the contrary, since it is >append only it should even be less problematic. IMO garbage collection >is an entirely different story and we shouldn't worry too much about it >until we have a good working model for clustering itself. > >Wrt. the journal: isn't that just the list of versions of the root node? >This should be for free then. But I think I'm missing something here... > >> >> How does backup work? this is quite tricky because it is >> difficult to get a consistent snapshot of the distributed >> tree. > >MVCC should make that easy: just make a backup of the head revision at >that time. > >Michael > >> >> Regards >> Marcel >> >> >> [0]=20 >>https://docs.google.com/presentation/pub?id=3D131sVx5s58jAKE2FSVBfUZVQSl1= W8 >>20_syyzLYRHGH6E&start=3Dfalse&loop=3Dfalse&delayms=3D3000#slide=3Did.g427= 2a65_0_3 >>9 >> [1] http://www.hpl.hp.com/techreports/2007/HPL-2007-193.pdf >> [2]=20 >>http://research.microsoft.com/en-us/people/aguilera/distributed-btree-vld >>b2008.pdf >>