directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Howard Chu <...@symas.com>
Subject Re: Bulk Loader : some ideas...
Date Sun, 17 Aug 2014 15:07:56 GMT
Emmanuel L├ęcharny wrote:
> Hi guys,
>
> sorry for having been absent the last few weeks (side project ...).
>
> I have spend part of the last two days reviewing the bulkLoader tool.
> Currently, on my machine, I'm able to process 30 000 entries in less
> than 20 seconds (this is around 1600 added entries per second). It's
> fats, compared to a direct injection of entries in a running LDAP
> server, but we can do way better.
>
> Here are some thoughts :
>
> - first of all, we need to process the DN before being able to inject
> entries into the master table. This is required because we have to
> inject the ParentID attribute into each entry, which requires we have a
> complete hierarchy available. Here, we have two options :
>
> 1) read the entry's DN from the LDIF file first, ignoring the values. We
> have a FastLdifReader class that does read the DN, and pass through the
> other elements. We assume that we have enough memory to hold all the DNs
> we will read (which could be a limitation when we try to bulkload tens
> of million entriues). Once done, we can associate an ID to each RDN, and
> get ready to inject those ID into the entries, which will be done while
> creating the MasterTable. What if we don't have enough memory ? Well, case 2
>
> 2) As we may not have enough memory to hold all the DNs, we have to do
> an external sort. Ie, we load a limited number of DNs, and once we have
> reached the limit, we sort what we get, and save the sorted result on
> disk. Once we have gone through al the DNs, we ends with N files that
> are all atomically sorted, but need to be gathered. Thsi is done in a
> second phase, by fetching the first Dn from each of the sorted files,
> until we have no more DN to read from any file.
>
> The second option will obviously be more costly, but it's guaranteed to
> work no matter how much memory we have. And if we are lucky to have
> enough memory to store all the elements in memory, we fail back to the
> first option.
>
> What we keep in memory is not only the DN, but the EntryUUID, so that we
> can store it into each entry later (this will be the parentId attribute).
>
> At the same time, we need to gather all the entryUUID, as we need them
> sorted to be able to create the masterTable (we have the same constraint
> than for DNs, ie, if we don't have enough memory, then we need to create
> temporary files). We keep a position in the original file to be able to
> fetch the entries directly.
>
> So, bottom line, at the end of this first phase, we will have a sorted
> list of UUID and an hierarchically sorted set of DNs. If we are lucky,
> they are in memory, and if we are very lucky (or if we don't have a lot
> of entries to process, even the entries will be in memory).
>
> - once this first phase is done, we can build the MasterTable : we just
> have to get the entryUUID from the sorted list of entryUUID, and create
> the masterTable from it. It's just a matter of fetching the entry from
> disk using its position, parse it and store it, once we have added the
> parentId and the other required system attributes (entryCSN,
> creatorsName, and creationTime).
>
> - at the very same time, we need to create the other indexes. For the
> same reason (memory limitation), we may have to go through intermediate
> files and process an external sort. In any case, we can delegate those
> index creations to a dedicated thread, to benefit from the multiple
> cores a computer has, speeding up the processing. The algorithm is no
> different than teh one we used to create the matser table : we store in
> memory the value associated with the entryUUID it is related to, and
> once done, we push them into a new Table.
>
> Side not though : for indexes, we may have multiple entryUUID associated
> with a value (typical case for ObjectClass index).
>
> I'm not sure we can do any better to imrpove the performances. Using
> cache for the MasterTable should not be useful, if we can't store all
> the entries in memory.
>
> thgoughts ?
>
Using two different approaches is gratuitous complexity. Using multiple passes 
and external sorting is unnecessary.

UUIDs are generally not a sortable value, or not meaningful anyway. That seems 
like a lot of wasted effort.

I explained OpenLDAP's approach once before but will summarize it again:

It's a single-pass loader, and it works regardless of the entry order of the 
input LDIF. For each entry, we examine its DN and recursively lookup all of 
its parent DNs in the RDN index, from top down. For any missing RDN in this 
lookup, we create the RDN index entry anyway, and generate an entryID for it 
at that time. We also keep an in-memory list of missing DNs.

If we encounter an entry later in the LDIF that corresponds to one of these 
missing DNs, the search in the RDN index will just return the entryID we 
already assigned to it. We then remove the DN from the missing DN list. The 
result is that the DB tables and entryIDs are generated in DN order even if 
the entries aren't ordered in the LDIF.

At the end of the run, any slots left in the missing DN list are printed out 
in a warning message. You can still use either the bulk loader or ldapadd to 
supply these entries later.

E.g., given an LDIF with these entries

dn: dc=example,dc=com

dn: cn=joe,ou=people,dc=example,dc=com

dn: cn=foo,ou=groups,dc=example,dc=com

dn: ou=groups,dc=example,dc=com


dc=example,dc=com gets entryID 1.

cn=joe,ou=people,dc=example,dc=com causes ou=people,dc=example,dc=com to be 
indexed with entryID 2, marked missing. cn=joe gets entryID 3.

cn=foo,ou=groups,dc=example,dc=com causes ou=groups,dc=example,dc=com to be 
indexed with entryID 4, marked missing. cn=foo gets entryID 5.

ou=groups,dc=example,dc=com finds that it's already got entryID 4, gets 
removed from missing DN list.

ou=people,dc=example,dc=com is printed as missing. Its entryID remains 
reserved for it and will be used when it gets properly added.

There's no redundant parsing of the LDIF required, no multiple passes of 
processing. You simply use the RDN index that you're already generating to 
keep the entry DNs properly sorted. There's no need for external sorts and no 
need to worry about what fits in memory (aside from the missing DN list, which 
you could also maintain as an external table if you really wanted to, but in a 
proper LDIF there should not be very many DNs in this list anyway).

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Mime
View raw message