directmemory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Kotek <dis...@kotek.net>
Subject Re: WAL Implementation
Date Sun, 24 Mar 2013 20:30:23 GMT


There are two formats in MapDB; 

Append-only has index size stored in memory, constructed by replay at startup.

Journal (and direct) have separate index because of compaction. It traverses 
all records and reinserts all data. This reclaims all unused disk space. 

Recid (record id) is offset in file and is fixed once allocated. If index offsets 
would be too high, I would have to keep disk space occupied to keep offset 
valid. Leaving index table in separate file (without data) keeps offset small 
and space reclaim possible. 

> Every new journal file is set to the max filesize at creationtime
> and is explicitly zero-filled.
> If an entry won't fit in a standard journalfile a special
> "full-overflow" journal file (only containing that single entry) is
> created.

Note:  journal==WAL in MapDB. Maybe not best terminology choice.

MapDB does not have journal overflow. User must call explicitly 'commit()' to 
open new log file. Only single WAL file is supported, it is always replayed on 
commit. There is no option to have multiple not-yet replayed logs. I have to 
keep thinks concurrent (fine grained locking) and with multiple logs would just 
skyrocket complexity.

> Every new journal file is set to the max filesize at creationtime
> and is explicitly zero-filled.

Sorry I have no time to study/comment your design

> What is your exact design and what do you think is the better approach?

Usually DBs use fixed-size pages (blocks) but this layer was removed in last 
version to save space. Now WAL is sequence of 'modification commands'. Each 
says 'write long (or byteArray) at this offset'. Each operation (such as delete 
or update) is split broken down into  sequence of modifications and written 
into WAL. I keep some data in-memory to keep track of modified or deleted 
records, but this is low overhead (typically 10 bytes per record)

I have no time to discus what approach is better. Just run some benchmarks and 
tell me if it is faster. Also current stuff is already obsolete, it uses global 
ReadWrite lock which will be soon removed. 

> PS: Your journal implementation is MapDB specific (at least a bit
> because of the Serializer - but could be used yeah :))

It depends on other classes such as Volume (ByteBuffer abstraction). But that 
can be removed very easily. I think that code is fairly low-level and 
portable.

j.

On Sunday 24 March 2013 20:11:23 Christoph Engelbert wrote:
> Hey Jan
> 
> Thanks for your answer.
> 
> I just had a short look over the code and you're using a separate
> index file, don't you? Is there any advantage?
> My current implementation is an append only, fixed sized journal.
> This means I write as much entries to the file as fit in the given
> journal filesize and roll over to a new journal. If all entries in
> an full journal file are executed the file is deleted or moved to an
> archive path.
> 
> Every new journal file is set to the max filesize at creationtime
> and is explicitly zero-filled.
> If an entry won't fit in a standard journalfile a special
> "full-overflow" journal file (only containing that single entry) is
> created.
> 
> The fileformat looks like this:
> 0x00 - 0x03    MagicHeader
> 0x04 - 0x07    Format-Version (currently 1 ;-))
> 0x08 - 0x0B    Filelength (to check if the filelength is corrupted
> by filesystem failure)
> 0x0C - 0x13    Logfile number (the number of the logfile for
> ordering multiple files while replaying)
> 0x14 - 0x14    Type of the Logfile (standard / full overflow)
> 0x15 - 0x18    Offset of the first dataset (normally 0x19 but can be
> used to inject additional properties in the header)
> 0x19 - ...         Journal records
> 
> JournalRecord (every position is calculated by record-base-offset +
> pos):
> 0x00 - 0x03    Records length (if first 4 bytes and last 4 bytes are
> equal the record isn't corrupted)
> 0x04 - 0x0B    Record ID, incrementing number
> 0x0C - 0x0C    Record type (application depending, defines type of data)
> 0x0D - 0x...     Records data
> 0x... - 0x...+4  Records length (needs to equals first four bytes of
> the record)
> 
> What is your exact design and what do you think is the better approach?
> 
> PS: Your journal implementation is MapDB specific (at least a bit
> because of the Serializer - but could be used yeah :))
> 
> Chris
> 
> Am 24.03.2013 19:41, schrieb Jan Kotek:
> > Hi,
> > 
> > There is WAL implementation (called journal) in MapDB. It has an
> > interesting feature that modified data written into log, are not stored
> > in memory, but can be re-read directly from log. MapDB is not exactly DB,
> > it is more like persistent heap.
> > 
> > Here is WAL storage implementation:
> > https://github.com/jankotek/MapDB/blob/master/src/main/java/org/mapdb/Stor
> > ageJournaled.java
> > 
> > There is also 'direct' (update on place) and append-only storage
> > implementation. Please note that I am currently reimplementing this store
> > to be lock-free. In couple of days this file will be completely replaced.
> > 
> > Hope it helps.
> > Jan
> > 
> > On Sunday 24 March 2013 19:13:26 Christoph Engelbert wrote:
> >> Hey guys,
> >> 
> >> after a few weeks heavily busy at work to bring our new game to open
> >> beta I finally have some time to work on lovely opensource stuff
> >> again :-)
> >> 
> >> Currently I'm implementing a generic WAL (Write Aheat Log / Journal)
> >> implementation, in first place for the persistence system at our
> >> company.
> >> 
> >> We collect statements in a queue to be written in a background
> >> thread to linearize database load.
> >> The problem about this approach is if db servers are busy this queue
> >> can take some time to be cleaned up and if the gameservers crash
> >> before the queue is cleared (or at least the background persister is
> >> killed - for whatever reason - yeah we had a bug where data weren't
> >> written for about 4 days) player data are lost.
> >> 
> >> The new system forced all statements to be written to disk before
> >> being enqueued so that journals can be replayed on gameserver
> >> startup. I haven't found any ready to use implementation beside
> >> implementations found in frameworks like Hadoop, databases (I guess
> >> it was derby), hornetmq, etc and so I started my own implementation.
> >> I'll try to make it as generic as possible to not force it to be
> >> used for persistency (SQL Statements) only but even for maybe
> >> journaling memory access (or whatever).
> >> 
> >> Do you guys think it could be interesting for DM to implement some
> >> thing as WAL in some place? Or do you have other interesting ideas
> >> what to do with it?
> >> 
> >> I'll look forward to hopefully an intensive discussion. Maybe
> >> someone else has found a WAL implementation that could be used /
> >> analysed :-)
> >> 
> >> Chris / Noc

Mime
View raw message