Hi guys,
Kiran just committed a first shot for the Mavibot bulk load feature.
It's working, at least for a limited set of data (we tested it with 30
000 entries generated with MakeLDIF).
There are still a lot to do :
- when trying to load more than 30 K entries, we get OOM. Growing the
memory is just not the right approach here
- sub-btrees are not bulk-loaded, resulting to a long time to build
indexes with duplicate values (like ObjectClass)
- the process will parse each entry twice
- some code cleanup is needed...
So for, the biggest issue is the first one. Every other are just minor
(and I think Kiran is already working on them).
The pb is that we won't be able to store all the read entries in memory,
so we have to pre-process them in some ways. Some constraints have to be
taken into account :
- when we inject an entry into the MasterTable, we have to add the
entryCSN and entryUUID if they aren't existing
- we also have to add the parentID in each entry, which means we have to
know which is the parent
- the RDN index is also built in a specific order
- that means we have to order the entries
- but we also have to order every indexed attribute if we want to build
the associated b-trees
Most of those issues would be easy to deal with if we can sort the
various elements before starting to build the b-trees.
One idea is to parse the entries from X to X+N, where N is an arbitrary
limit (it can be adapative), and create sorted files for each of the
elements (ie entries, and indexes). Once all the files created (and we
may have many), we wil just have to process them by reading the first
entry of each file, and pick the smallest one to inject it into the
associated btree. For instance :
data = G P D T N O S R B Y K
we can just deal with 4 of them in memory, so we produce 3 sorted files :
f1 : D G P T
f2 : N O R S
f3 : B K Y
Now, we can create the index using the 3 files in parallel :
1) min(f1, f2, f3) -> f3 B, f3{K Y}
2) min(f1, f2, f3) -> f1 D, f1{G P T}
3) min(f1, f2, f3) -> f1 G, f1{P T}
4) min(f1, f2, f3) -> f3 K, f3{Y}
5) min(f1, f2, f3) -> f2 N, f2{O R S}
6) min(f1, f2, f3) -> f2 O, f2{R S}
7) min(f1, f2, f3) -> f1 P, f1{T}
8) min(f1, f2, f3) -> f2 R, f2{S}
9) min(f1, f2, f3) -> f2 S, f2{}
10) min(f1, f2, f3) -> f1 T, f1{}
11) min(f1, f2, f3) -> f3 Y, f3{}
As you can see, we are pulling the data in the right order from each of
the file, to generate a sorted index.
Of course, it has to be done for each index, which means we will have
many sets of files (one set per index).
All in all, I don't see any better way, but this is why I want your input !