incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Seaborne <>
Subject Different policy for concurrency access in TDB supporting a single writer and multiple readers
Date Tue, 22 Mar 2011 15:06:05 GMT
Switching to email for general discussion:


Could you say something about your design ideas for transactions and 
concurrency in Parliament?  And specifically, what is the unit of MVCC?

I've mainly been thinking about the design for TDB, and a design that 
uses blocks as the unit of transaction and MVCC, with each writer seeing 
a view of the block space that accords with its initial time point

Using a WAL as a redo log, the changes are not in the main database 
until the WAL is checkpointed.

If you have the time sequence:

R1: start reader 1
W1: start writer 1
W1: finish writer 1
# a reader here is like R1
R2: start reader 2
W2: start writer 2
W2: finish writer 2
R3: start reader 3
... finish queries ...

which is single writer but reader 1 spans two writers.

My idea has been to create an "IP layer for data" (IP as in Internet 
Protocol) over blocks, rather that making the data structures above that 
layer (B+Trees, Extensible hash tables) versioned - the data structure 
code has to be slightly version aware to pass the version context up and 
down to the block layer but the algorithms are essentially not version 
specific.  This depends on making the block layer sufficiently efficient 
which might become the practical transaction size limitation. 
Compression, or recording operations on block according to type (an 
extreme form of application-driven compression), would be very good.  It 
retains the data publishing-focus of TDB - that is, reader and reader 
performance is primary and costs go mainly on the writers.

Only experimentation will tell.  It is my hope to have other data 
structures for indexing in the future so a clean separation of the 
tranasaction/MVCC issues and the data structure algorithms is going to 
be good.

Given that design, SERIALIZABLE is the natural isolation policy.  There 
is very little locking over blocks in the main database (see below for 
the possibility of none!)

That's not to say that the API design should not have these constants in 
even if not used.

It happens naturally, each writer sees its modified blocks (as written 
to the WAL) and the blocks of past committed writers + the base 
database.  Readers just don't get changed blocks; reader 1 is working 
agains the main database; reader 2 gets modified blocks from the WAL 
because the WAL can't be all written to the main database until reader 1 
gets out the way.

READ_COMMITTED makes sense for limiting growth of the WAL as it means 
that the WAL can be flushed to the main database before an earlier 
reader has finished.  It would have to be done in such a way that all 
the changes happen at once or the data structure invariants could be 
invalid; that is, the reader would also have to be outside such an index 
as well and that needs some locking that would not be there otherwise.

REPEATABLE_READ would need block locking for the updates but seems to 
have a quirk. Unfortunately, just knowing which blocks the reader has 
seen may not be strong enough as a change to the blocks of an index may 
result in a change to a block the reader has looked at and one it has 
not in some data structure consistency way, but the block locking will 
miss that.  Not sure that's safe for the B+trees - it's not in general.

READ_UNCOMMITTED is very likely to break the index structures as the 
higher level assumptions in the B+trees will not be valid.

A feature of the TxTDB MVCC design is that locking is not being done on 
the main database.  Changes only happen when writing back committed 
blocks, and then you know the original block to be over-written is never 
seen by a reader.  Write-back occurs only when all readers upto that 
writer version have exited and so any outstanding reader will be getting 
the updated block via the WAL copy. No block lock needed (a bit scary 
that bit :-).

Recovery is a matter of redoing the WAL so it is potentially more 
expensive than an undo log where the undo actions are just running 
transactions, not a period of recent history of writers.  But we don't 
crash do we? :-)

I think that limiting the number of versions back that will be admitted 
to some control constant (like 10 or 20, ideally adapting to few for 
larger tranactions) will be appropriate.  if it's growing, then it's a 
sign the system is being asked to do more than it has the processing 
power to do.

Most of the time, the WAL is write-only, and there are in-memory copies 
of blocks (AKA a cache, with version tiers). This is analogous to the 
direct-mode write cache there is currently, but with an additional 
write-back policy as it needs to sync when a transaction commits.


On 21/03/11 22:09, Stephen Allen (JIRA) wrote:
> [
> ]
> Stephen Allen commented on JENA-41:
> -----------------------------------
> I think your idea about the DatasetGraph being the interface for
> transactions makes sense.  Transactional DatasetGraphs could also
> provide fallback behavior for legacy code by implementing autocommit
> transactions if the user called methods on a dataset that was not
> initialized in a transactionBegin() call.
> With regard to the isolation levels, I believe some of the lower
> levels can make sense for particular applications or queries.  For
> example say you want to know the size of a few of graphs.
 >select count (*) where { graph<http://example/g1> { ?s ?p ?o . } } ;
 >select count (*) where { graph<http://example/g2> { ?s ?p ?o . } } ;
> Assuming a traditional pessimistic locking scheme, running the
> transaction at SERIALIZABLE could cause the locks held by the first
> select query to also be held through the second query, reducing
> concurrency (using two transactions instead might not be a good idea
> as there is usually some amount of overhead associated with creating
> and committing transactions).
> If you were OK with the possibility that the two query results are
> not truly serializable with respect to each other, then you could
> improve concurrency by using a READ_COMMITTED isolation level instead
> that would give serializable results for each query (but not the
> whole transaction).  And if you really just needed a rough estimate
> of size, using READ_UNCOMMITTED may be able to avoid locking all
> together.
> An additional motivating factor for MVCC implementations is that they
> may be implementing snapshot isolation, which probably maps better to
> READ_COMMITTED than SERIALIZABLE (especially if it could do predicate
> locking for true serializable behavior but allow cheaper snapshot
> isolation if that was all that was needed).  The Postgres
> documentation does a good job of describing this [1].
> I would find it useful to have multiple isolation levels available
> (even if internally I'm mapping them all to SERIALIZABLE at first).
> The four ANSI Isolation levels seem appropriate, and remember that
> implementations are allowed to map unavailable lower levels to higher
> levels as desired.
> [1]

View raw message