directmemory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Engelbert <noctar...@apache.org>
Subject Re: WAL Implementation
Date Sun, 24 Mar 2013 21:17:50 GMT
Am 24.03.2013 21:30, schrieb Jan Kotek:
>
> There are two formats in MapDB; 
>
> Append-only has index size stored in memory, constructed by replay at startup.
>
> Journal (and direct) have separate index because of compaction. It traverses 
> all records and reinserts all data. This reclaims all unused disk space. 
>
> Recid (record id) is offset in file and is fixed once allocated. If index offsets 
> would be too high, I would have to keep disk space occupied to keep offset 
> valid. Leaving index table in separate file (without data) keeps offset small 
> and space reclaim possible. 
>
>> Every new journal file is set to the max filesize at creationtime
>> and is explicitly zero-filled.
>> If an entry won't fit in a standard journalfile a special
>> "full-overflow" journal file (only containing that single entry) is
>> created.
> Note:  journal==WAL in MapDB. Maybe not best terminology choice.
>
> MapDB does not have journal overflow. User must call explicitly 'commit()' to 
> open new log file. Only single WAL file is supported, it is always replayed on 
> commit. There is no option to have multiple not-yet replayed logs. I have to 
> keep thinks concurrent (fine grained locking) and with multiple logs would just 
> skyrocket complexity.

Ok I see that in theory every "transaction" is a single journal that
is replayed against the database on commit.

>> Every new journal file is set to the max filesize at creationtime
>> and is explicitly zero-filled.
> Sorry I have no time to study/comment your design

No prob ;-)

>
>> What is your exact design and what do you think is the better approach?
> Usually DBs use fixed-size pages (blocks) but this layer was removed in last 
> version to save space. Now WAL is sequence of 'modification commands'. Each 
> says 'write long (or byteArray) at this offset'. Each operation (such as delete 
> or update) is split broken down into  sequence of modifications and written 
> into WAL. I keep some data in-memory to keep track of modified or deleted 
> records, but this is low overhead (typically 10 bytes per record)

The good thing, I don't need to bother with DB interna because my
WAL implementation sits infront of the normal DB stuff but the
general design seems to be very similar. Every modification is a
single entry in the journal with the difference that I just used
append only.

> I have no time to discus what approach is better. Just run some benchmarks and 
> tell me if it is faster. Also current stuff is already obsolete, it uses global 
> ReadWrite lock which will be soon removed. 

Well I guess benchmarks are not everything, it needs to be fast but
it needs to be extremely safe for me. If you need to write multiple
files it is not guaranteed that both are written (but I guess a
broken index can be rebuild by crawling the journal).

>> PS: Your journal implementation is MapDB specific (at least a bit
>> because of the Serializer - but could be used yeah :))
> It depends on other classes such as Volume (ByteBuffer abstraction). But that 
> can be removed very easily. I think that code is fairly low-level and 
> portable.

Thanks for your comments, I'll take a deeper look in the new version
when it's done.

Chris

> j.
>
> On Sunday 24 March 2013 20:11:23 Christoph Engelbert wrote:
>> Hey Jan
>>
>> Thanks for your answer.
>>
>> I just had a short look over the code and you're using a separate
>> index file, don't you? Is there any advantage?
>> My current implementation is an append only, fixed sized journal.
>> This means I write as much entries to the file as fit in the given
>> journal filesize and roll over to a new journal. If all entries in
>> an full journal file are executed the file is deleted or moved to an
>> archive path.
>>
>> Every new journal file is set to the max filesize at creationtime
>> and is explicitly zero-filled.
>> If an entry won't fit in a standard journalfile a special
>> "full-overflow" journal file (only containing that single entry) is
>> created.
>>
>> The fileformat looks like this:
>> 0x00 - 0x03    MagicHeader
>> 0x04 - 0x07    Format-Version (currently 1 ;-))
>> 0x08 - 0x0B    Filelength (to check if the filelength is corrupted
>> by filesystem failure)
>> 0x0C - 0x13    Logfile number (the number of the logfile for
>> ordering multiple files while replaying)
>> 0x14 - 0x14    Type of the Logfile (standard / full overflow)
>> 0x15 - 0x18    Offset of the first dataset (normally 0x19 but can be
>> used to inject additional properties in the header)
>> 0x19 - ...         Journal records
>>
>> JournalRecord (every position is calculated by record-base-offset +
>> pos):
>> 0x00 - 0x03    Records length (if first 4 bytes and last 4 bytes are
>> equal the record isn't corrupted)
>> 0x04 - 0x0B    Record ID, incrementing number
>> 0x0C - 0x0C    Record type (application depending, defines type of data)
>> 0x0D - 0x...     Records data
>> 0x... - 0x...+4  Records length (needs to equals first four bytes of
>> the record)
>>
>> What is your exact design and what do you think is the better approach?
>>
>> PS: Your journal implementation is MapDB specific (at least a bit
>> because of the Serializer - but could be used yeah :))
>>
>> Chris
>>
>> Am 24.03.2013 19:41, schrieb Jan Kotek:
>>> Hi,
>>>
>>> There is WAL implementation (called journal) in MapDB. It has an
>>> interesting feature that modified data written into log, are not stored
>>> in memory, but can be re-read directly from log. MapDB is not exactly DB,
>>> it is more like persistent heap.
>>>
>>> Here is WAL storage implementation:
>>> https://github.com/jankotek/MapDB/blob/master/src/main/java/org/mapdb/Stor
>>> ageJournaled.java
>>>
>>> There is also 'direct' (update on place) and append-only storage
>>> implementation. Please note that I am currently reimplementing this store
>>> to be lock-free. In couple of days this file will be completely replaced.
>>>
>>> Hope it helps.
>>> Jan
>>>
>>> On Sunday 24 March 2013 19:13:26 Christoph Engelbert wrote:
>>>> Hey guys,
>>>>
>>>> after a few weeks heavily busy at work to bring our new game to open
>>>> beta I finally have some time to work on lovely opensource stuff
>>>> again :-)
>>>>
>>>> Currently I'm implementing a generic WAL (Write Aheat Log / Journal)
>>>> implementation, in first place for the persistence system at our
>>>> company.
>>>>
>>>> We collect statements in a queue to be written in a background
>>>> thread to linearize database load.
>>>> The problem about this approach is if db servers are busy this queue
>>>> can take some time to be cleaned up and if the gameservers crash
>>>> before the queue is cleared (or at least the background persister is
>>>> killed - for whatever reason - yeah we had a bug where data weren't
>>>> written for about 4 days) player data are lost.
>>>>
>>>> The new system forced all statements to be written to disk before
>>>> being enqueued so that journals can be replayed on gameserver
>>>> startup. I haven't found any ready to use implementation beside
>>>> implementations found in frameworks like Hadoop, databases (I guess
>>>> it was derby), hornetmq, etc and so I started my own implementation.
>>>> I'll try to make it as generic as possible to not force it to be
>>>> used for persistency (SQL Statements) only but even for maybe
>>>> journaling memory access (or whatever).
>>>>
>>>> Do you guys think it could be interesting for DM to implement some
>>>> thing as WAL in some place? Or do you have other interesting ideas
>>>> what to do with it?
>>>>
>>>> I'll look forward to hopefully an intensive discussion. Maybe
>>>> someone else has found a WAL implementation that could be used /
>>>> analysed :-)
>>>>
>>>> Chris / Noc


Mime
View raw message