jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <mreut...@adobe.com>
Subject RE: [jr3] clustering
Date Thu, 01 Mar 2012 15:16:00 GMT
Hi,

> An open question though is how replication would fit into this picture.
> There is some mention in the paper about backup nodes for fail-over. Not
> sure if that is what we are aiming for or whether we want to go beyond
> that.

I think what they use is a backup server, which is kept in sync and can act
as a fail over. the strict synchronization does look a bit troublesome. maybe
this could be relaxed when a commit only contains changes for a single
server.

> The paper assumes network connections to be reliable (i.e. no messages
> altered, dropped or duplicated). However there is no mention on how the
> system would recover from a partitioned network. That is, how it would
> recover when some links go down and come up later. However, since it
> uses 2 phase commit, I think it would basically inherit that behaviour
> which means cluster nodes could become blocked (See [1] proposition 7.1
> and 7.2).

writes never block in that system, they would simply fail until the pending
transaction is either committed or aborted. IIUC reads will never block. 

> OTOH the combination of optimistic locking during the transaction itself
> and pessimistic locking only for the commit itself will probably result
> in very good write throughput. Even more so since probably in many cases
> there is only a single node involved in the transaction such that a
> simple commit suffices.
> 
> More comments see inline below.
> 
> [1] http://research.microsoft.com/en-us/people/philbe/ccontrol.aspx
> 
> On 1.3.12 11:05, Marcel Reutegger wrote:
> 
> [...]
> >
> > so, I was thinking of something similar as described in this
> > paper [1] or similar [2]. since a B-tree is basically an ordered
> > list of items we'd have to linearize the JCR or MK hierarchy. I'm
> > not sure whether a depth or breadth first traversal is
> > better suited. maybe there even exists a more sophisticated
> > space filling curve, which is a combination of both. linearizing
> > the hierarchy on a B-tree should give some since locality for
> > nodes that are hierarchically close and probability is high that
> > they are requested in succession.
> 
> Node types may give hints here. As long as they are not recursive (i.e.
> nt:hierarchy) node types usually define "things that belong together".
> 
> [...]
> 
> > Open questions:
> >
> > how does MVCC fit into this? multiple revisions of the same
> > JCR/MK node could be stored on a B-tree node. whenever
> > an update happens the garbage collection could kick in an
> > purge outdated revisions. providing a consistent journal across
> > all servers is not clear to me right now.
> 
> I think MVCC is not a problem as such. To the contrary, since it is
> append only it should even be less problematic. IMO garbage collection
> is an entirely different story and we shouldn't worry too much about it
> until we have a good working model for clustering itself.
> 
> Wrt. the journal: isn't that just the list of versions of the root node?
> This should be for free then. But I think I'm missing something here...

the model I have in mind doesn't have root node versions that
correspond to MK revisions. Is this mandated somehow by the MK
API design?

in my model only the nodes that changed get new revisions.
and reading from the tree with a given revision means it
will pick the revision which is less or equal to the given revision.

e.g. if you have a node /a/b/c which was changed three times
in revision 2, 7, and 12 and a client reads at revision 9. the
implementation will return revision 7.

I don't see a need why the parent node needs to be updated
when a child node is added, removed or updated.

> > How does backup work? this is quite tricky because it is
> > difficult to get a consistent snapshot of the distributed
> > tree.
> 
> MVCC should make that easy: just make a backup of the head revision at
> that time.

hmm, I'm not sure that will scale. consider a large repository
where traversing all nodes takes a long time.

I think backup should be supported at a lower level to be
efficient.

e.g. something like proposed in [0] 4.9.

regards
 marcel

[0] http://cs.ucla.edu/~kohler/class/08w-dsi/aguilera07sinfonia.pdf


Mime
View raw message