directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lecharny <elecha...@iktek.com>
Subject Re: Streaming / Serializing Big Objects
Date Fri, 08 Sep 2006 21:52:04 GMT
Sorry, it was supposed to be sent to the ML, but I forgot to do a Reply-All.

So here is my answer to Ole :

Some though and enlightement :

Ole Ersoy a écrit :

> Cool -
> OK suppose we had a StateManager.
>
> The StateManager has a decode method on it that reads a persistent file
> and creates the directory tree.
>  
>
The directory tree is totally different. We have many files, including a 
Master table which contains the entries, and other files which stores 
the indices. This is a choice that can be discussed, but basically, we 
*never* read the file at start (remember that we could have millions of 
entries. The cache system (which could perfectly be something like 
Hibernate, prevayler, or whatever persistent cache) is loaded on the 
fly. However, just keep in mind that a Ldap server is not intended to be 
stop very often !

> The StateManager's encode method uses a list of
> references to directory tree objects
> creating a concatenated String of the string
> representation of all these objects, and then writes
> the string to a file, once all the concatenation is
> done.
>  
>
Objects stored in the Master table (let's call them entries) have this 
structure :
Entry :
-  DistinguishedName (which contains basically two strings), the unique key
- attributes which are a list of :
   - attribute which are : <a name, a list of :>
     - values (byte[] or String, or - and this this what we are talking 
about - a reference to a persisted data)

I don't really see what a StateManager can bring here. What we just need 
to do is to store an attribute value somwhere, and be able to send it 
back to the user, limiting the memory footprint to do so to a minimal 
value (let say, 1024 bytes, for instance). If we store a reference to 
this persisted data - be it a file name, a key to a blob into a 
database, a mail on google, if we create 10000 gmail account to be able 
to store 2Tbytes of data for free :) ...

> Am I getting any warmer?
>  
>
I can't say. But may be my explenation are not clear enough :)

> I read a little about prevayler.  It just serializes
> all the java objects that need to be peristed
> immidiately as it becomes aware of them, I think, and
> then keeps them updated as the objects mutate.
>
We are not really willing to store java objetcs, but byte[] or Strings. 
I know, technically speaking, they are objects :) , but they can also be 
seen as streams of bytes, which they are, after all !

>  So if
> the application crashes, on reboot it will read the
> persistant files and be back up.
>
I hope that the backend will be able to be reliable ! Atm, there is 
nothing really done to assure that we can't loose data if we brutally 
stop the server, except a flag which force the 'synch-on-write' when 
modifying the data. But we may have problems, because we don't support 
transactions. We need to support transactions, and a kind of shadow 
pages mechanism, à la RDBMS. Still a work in start (can't say work in 
progress, when we just have a few JIRAs and confluence pages about it :)

>  To make reboot more
> efficient, the persistant files can be managed like I
> described above with the StateManager on a clean
> shutdown, which I think is what you are describing.
>  
>
If shutdown the server cleanly, the database is supposed to *always* be 
in correct state. And again, when starting the server, we don't load any 
data, except managment data (like Root DSE). We may stores the cache, 
and reload it, to improve the 'warm up' process, that seems a really 
good idea, but as I said, shutdowning a ldapserver should not occurs 
very frequently ...

> The reason I mention this is because as the directory
> tree mutates, we would not want to persist the entire
> tree per mutation right?  So we would have to either
> use relational persistance, or write a single file
> just containing the mutation.   
>
If we 'mutate' the directory tree, the cache should be updated 
accordingly. Basically, and if we consider the existing server, what we 
do is to remove from the cache the modified data. Saving a copy of the 
mutated tree before its mutation is not an option. It is far much better 
to modify a copy of this sub-tree, and when the modification is done, 
then switch the old tree and the new tree. But this is really not easy, 
as we are not storing trees has a whole, but many trees (one per index 
file) plus a full bench of entries into the master database. This is not 
a simple matter, and it's difficult to explain, too... There is a kind 
of explanation here :
http://docs.safehaus.org/display/APACHEDS/Backend

> That would mean we are in more of an rsync like mode,
> where if the server crashes, we load the original
> directory tree file + any mutation files.
>  
>
Yeah, there is definitively something to dig around this idea. I like 
that :) This is what is doing OpenLdap with its journal (logs files).

> If the directory shuts down cleanly we encode all the
> directory objects to one file and delete all the
> "temporary" mutation files.
>  
>
If the server shutdown cleanly, I think that nothing should be done. If 
you consider that you have a kind of journal/shadow page that contains 
all the not yet applied modifications, then the last thing that the 
server should do is to wait until those modification are done. When a 
modification is done, then the corresponding journal/shadow page should 
be removed (or marked as applied). If we have a problem, then, with the 
support of transaction, we may be able to rollback. A lot of work ...

> Incidentally EMF can be used for any type of
> serialization, a concatenated file like the one I just
> described, xml, relational persistance, etc.  One of
> the benefits of EMF is that if for whatever reason
> someone wanted to serialize to XML, implementing a
> function to do so would be very straight forward.  If
> someone wanted to serialize to a relational source,
> that's easy too.
>  
>
We definitively have to implement a RDBMS backend. RDBMS offer all those 
mechanisms for free, no need to be hit by the NIH syndrom :) And we also 
have to remember that we are *not* writing Derby, but ADS !

> There's also the EMF Technology projects's Object
> Constraint Language can be used to query the EMF
> model...and  I would think it would be very useful for
> creating directory like queries and coding the query
> api.
>  
>
well, a liitle bit to far for my little brain ..

> There's an article on the eclipse site just written on
> how to use it.
>  
>
Good. Let's experiment. Talks are good, reding are great, but writing is 
better and implementing a solution is a must !

> Cheers,
> - Ole
>  
>

Emmanuel


Mime
View raw message