directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trustin Lee <trus...@gmail.com>
Subject Re: [jira] Created: (DIREVE-170) Standarzied serialization and deserialization of Name, Attribute, and Attributes.
Date Tue, 28 Jun 2005 15:39:02 GMT
Hi,

2005/6/28, Niclas Hedhman <niclas@hedhman.org>: 
> 
> On Tuesday 28 June 2005 08:23, Trustin Lee wrote:
> 
> > The biggest problem is the class descriptors written by
> > ObjectOutputStream. It is sometimes even bigger than actual object data. 
> We
> > can override some protected methods to store the descriptors somewhere
> > else, and it makes the serialized data dependent to the descriptor
> > database.
> > I even saw the case that SMS message object is serialized 2kB data
> > because its class descriptor took up 1.4kB.
> 
> Hmmmm... What tests have you actually run?
> You can't do without the FQ classnames of the classes involved. They are
> written in 'clear text' once for each class, then referenced with an index
> (int IIRC). Whether or not you need the field names, is your call, but it
> sounds like a decent system to not depend on knowing the exact ordering.
> The codebase URLs is the third item which written out, which of course can 
> be
> very large.
> 
> import java.io.*;
> 
> public class Test
> {
> static public void main( String[] args )
> throws Exception
> {
> FileOutputStream fos = new FileOutputStream( "abc.ser" );
> ObjectOutputStream oos = new ObjectOutputStream( fos );
> Abc abc = new Abc();
> oos.writeObject( abc );
> oos.close();
> }
> 
> private static class Abc implements Serializable
> {
> String abc = "1";
> String def = "2";
> }
> }
> 
> Typically case??? Well, it results in 75 bytes.

 Yes, 75 bytes for only two single character strings are huge. :)

> What if the name of class changes?
> 
> I assume this is a rhetorical question, since I am sure you know the 
> answer. I
> am interesting to know how you are going to handle that in your own
> serialization framework.

 We don't specify type name, because we know what type will come in the 
stream.

> And if we implement readObject and
> > writeObject by ourselves, why do we use ObjectOutputStream?
> 
> Because you don't need to worry about complex classes, and diving into the
> hierarchies of instances, which you would for both "rolling your own" as 
> well
> as Externalizable.

 Right. I thought Attributes, Attribute, and Name are simple enough to 
forget about a complex object graph. But attribute values should be able to 
contain any Java objects, so I'm thinking about allowing Java objects there 
only.

> Moreover, it
> > adds extra metadata that indicates each field's type that increases the
> > size of serialized data. If we implement readObject and writeObject
> > manually, there's no need to include those metadata IMHO.
> 
> Serialization writes the field names to the stream, so that it can restore 
> the
> fields even if they were re-ordered in the class. I think you have 
> observed
> that when you use writeObject(), the field names are till written to the
> stream. I don't know the answer to that, since the deserialization can not
> possibly know what to do with it.

 You're right.
 
> > My aim is to create compact and fast codec for LDAP-specific entities
> > (LdapName, Attribute, Attributes) that is Java-independent so that they 
> are
> > used to create another protocol based on ApacheDS or to store data in
> > Java-independent way.
> 
> If they are flat, i.e. basically strings or collections of strings, then I
> agree that serialization is not necessarily any added value. But are you 
> not
> allowed to store any arbitrary Object in attributes?

 Attribute values can be any Java objects actually. So I'm going to use 
object serialization only for that case. But most often used types such as 
string and byte[] will have to be handled specially to gain maximum 
performance.
 LDAP entries are usually stored to B+Tree implementations, so we have to 
initialize ObjectInputStream and ObjectOutputStream each time we read or 
write objects, and it is major performance panelty because it usually gives 
us additional memory allocation and copy and it cause class descriptors are 
written every time again and again (in regular stream, it is not a problem 
because they are reused, but it becomes a problem in the environment like 
this). Plus, the size of entry impacts the performance of backing storage if 
massive operation is being performed. Making serialized data smaller gives 
performance gain because it makes database contain more items per page.
 If performance is not a problem, we can just go with object serialization, 
but currently our performance is not really good, and it is being caused by 
large extra I/O from object serialization.
 Trustin
-- 
what we call human nature is actually human habit
--
http://gleamynode.net/

Mime
View raw message