directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kiran Ayyagari <>
Subject Re: Mavibot bulkloader update
Date Wed, 24 Sep 2014 12:07:19 GMT
On Wed, Sep 24, 2014 at 2:45 PM, Emmanuel L├ęcharny <>

> Hi guys,
> a quick update on the bulkloader development...Using what Kiran did in
> the mavibot-partition-bulkloader, I moved the part that specifically
> handle the BTree bulkloading to Mavibot, because there is no reason
> someone could not bulkload any kind of data into a BTree.
> I created a BulkLoader class in Mavibot for that purpose. The bulk
> loading is a 2 phases process :
> - sort the data
> - create the btree
> We had something working for in-memory data (ie, when you can read
> everything in memory), but when it comes to read huge set of data, you
> have to write sorted data on disk. This was what I was working on
> lately. Basically, we read a chunk of data, and when we reach a given
> number (configurable), we sort what we read and write it back on disk. I
> tested it with 10M elements, written on 100 files. The deal is to merge
> the sorted data when we read them back from the files (as each file is
> sorted, it's just a matter to take the sammlest value from all the files).
> All in all, this part is now working.
> The second phase was an issue in the current bulkload implementation :
> it left the btree unbalanced (when you fill the leaves completely, the
> last leaf may contain less than N/2 elements : the current
> implementation didn't handle this case). It was quite a challenge to
> balance a btree when you were buildling it, as the pb spread to all the
> upper layers. After a few hours thinking about the best way to do it
> without having to shuffle many already written pages, I realized that we
> actually *know* how many entries we have in the BTree, as we have sorted
> the entries at first ! Knowing the numbe rof elements is the key : you
> can predict how the btree should be balanced.
> This is the part I'm working on atm.
> There are a few things that need some love still :
> - the entrySerializer has to be moved from JDBM to the Mavibot Partition
can we move this to a xdbm-partition module?

> - the Mavibot partition builder has to be rewritten using the Mavibot
> bulkloader (but that should be easy, as we just have to call the
> BulkLoader primitive)
> - the In-memory bulk load has to be implemented
> - I'd like to implement a compact() method taht takes an existing BTree
> and compact it on the fly
hmm, how about applying this to the entire database level rather than just
an individual BTree?
This helps us in writing all the data to a new file when the existing file
with too many free pages or sparse nodes.

- we have to deal with the RecordManager, because atm we can't have it
> opened while processing a btree. It could be interesting to be able to
> compact a btree while the RM is up and running (assuming no write
> operation is done in the mean time).
> right, this is why I think compacting should be applied to entire database
instead of a single BTree.

otoh, bulkloader will be quite handy here while compacting, all we need to
do is pass the relevant cursors of the existing BTrees.

> Here is where we are atm. More to come in the next days !

Kiran Ayyagari

View raw message