cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <>
Subject Re: XML Compilation
Date Wed, 18 Oct 2000 10:37:12 GMT
Sylvain Wallez wrote:
> Stefano Mazzocchi a écrit :
> >
> > Sylvain Wallez wrote:
> > >
> > > Great, great !
> > >
> > > A few suggestions to make the format more compact without going into
> > > complicated compression algorithms :
> > >
> > > - using a byte instead of an int for would divide SAX instructions code
> > > size by 4.
> >
> > I'm already using a byte :)
> >
> >  OutputStream.write(int c);
> >
> > already discards the upper 24 bits (read the javadoc to find out)
> >
> Ah! I forgot about that... I always found it strange to use an int
> parameter to write a byte (speed concerns ?).

Bah, probably... it removes the int->byte conversion since most people
use ints... but I agree those interfaces are pretty crappy about
raw data I/O. (and this is why I rewrote my own with specific semantics)
> > > - repetition of elements name and namespace can be avoided for
> > > "endElement" : XMLInterpreter can hold a stack of open elements to
> > > retrieve these values.
> >
> > hmmm, good point, I'll try that and see if it's worth... during
> > development I found out that not all optimizations end up being such...
> > for example, increasing the buffer size from 8Kb to 16Kb slows things
> > down on my system (which is very strange).
> >
> > > - the first time a string is output, assign it a number (incrementing
> > > counter from the start of the document) and output other occurences of
> > > the string as that number. Since XML is highly redundant, this would
> > > save much, much space. Sure, this increases write time, but will reduce
> > > read time. But this can lead to issues regarding memory consumption,
> > > since it requires to keep all previously read strings.
> >
> > I'm already doing this :)
> Sorry, I carefully studied XMLCompiler, but went fast over
> CompiledXMLxxxStream.

No problem.

> >
> > > Several XML compression tools are also listed on
> > > but I think we're
> > > talking here more about a binary format than compression.
> >
> > No, we are talking about the fastest parsable format you can compile SAX
> > events in. Speed is my main concern, not size (even if, given the same
> > speed, I optimized for size).
> >
> > > BTW, I see several great uses for compiled XML :
> > >
> > > - store pre-parsed XML files on disk (in the repository) to allow fast
> > > reload whenever they're thrown out of the cache. I'm not sure if it's
> > > very useful for XSPs since they're already compiled into class files,
> > > but it will surely be for XSLTs.
> >
> > Totally, this is the main reason to write such a thing.
> >
> > I forecast some "compile by xml WAR" tool inside Cocoon2 that will allow
> > you to precompile all the XSPs and all the static XML documents and
> > package the whole thing for production, indicating all compilation and
> > validation problems.
> >
> > If you think about it, XML and Java are very similar in this concern.
> >
> Just as javac stores line number and filename information in class
> files, what do you think about optionnaly storing Locator information in
> the compiled XML ? This would allow easier debugging when a tranformer
> detects an error when processing an XMLC file (not validity error, which
> would have been found during initial parsing, but a semantic or
> application level error in the data which occurs at run time).

Sounds like a great idea, didn't think about that...yeah, very nice

> > > - store element-generation code of XSPs into XML bytecode fragments
> > > using byte[] variables. This will make java files much smaller and help
> > > avoiding reaching the 64k size limit for methods bodies.
> >
> > Yep, this is the where the idea came from (I had a post on cocoon-users
> > about this a while ago talking about this subject).
> >
> > > - Off topic WRT Cocoon, but IMO worth studying : reduce network load
> > > between XML-enabled applications that understand this format. Mmmh, this
> > > can be the first step of a "binary/xml" mime-type ! Once browsers accept
> > > it, we can choose to send XML or XMLC demending on the http-accept
> > > header !!
> >
> > No, that would totally suck and I tell you why:
> >
> > 1) CXML is normally bigger than the original file. Not much, but it
> > rarely compresses (since only string redundancy is eliminated).
> >
> Mmmh... is it bigger because writing ints to reference element name,
> namespace URI, etc is bigger than the average size of the "prefix:name"
> string ?

No, it's bigger because of the string encoding which is always UTF
(didn't want to have tons of encoders around) so it might take as much
as 3 bytes for a single char, while it takes 1 byte for the most common
7-bit ASCII chars.... it's asymmetrical.

Not that I'm happy with this solution: my profiling indicates that 30%
of the time is spent in reading and converting the bytes... so, I can
hear you thinking, why don't you encode plain unicode and forget about

Well, it turns out that plain unicode (straight 16 bits per chars)
always doubles the chars and while is perfect for chinese, it sucks big
time for all ASCII or latin ISO charsets. But since I care for speed
first, I tried to profile that and it turns out that it's much slower!

It's hard to know why, but I have some ideas:

1) the Sun JVM is slow as hell in the native library (didn't try IBM's
which is supposed to be the fastest) and native code is never

2) buffering helps a great deal (I got 100-fold performance
improvement!!!) but the more bits to read from disk the slower, no
matter what.

3) tight loops are very well optimized by hotspot... in fact, Xerces
improves for about 30% over time while my code improves for less than 5% 

Possible solutions are:

1) try to apply an incremental compression algorithm to the char arrays
(huffman requires dictionaries so it's not great for small fragments,
barrow-wheeler is too slow... lempel-zip 77 might be the only good one
since 79 is patented)

2) use the remaining bits of the UTF encoding to store delta
information... for example

 this is the thing

might be encoded as
 this [2:3][8:2]e [12:3]ng

where [x:y] indicate "go back x chars and copy y chars here"

which is a weak alternative to the lempel-zip approach but might be
faster to perform.
> > 2) textual compressors such as gzip compress XML better than CXML.
> >
> > 3) XML compressors (such as XMill) perform much better than gzip even
> > for well-formed documents.
> >
> > CXML focuses on speed then size.
> >
> > XMill focuses on size then speed.
> >
> > Also, CXML is highly asymmetrical: it's much faster to interpret than to
> > compile.... while for normal XML publishing, you need a fast way to
> > "generate" SAX events but also a fast way to "consume" these events and
> > serialize them into a stream of chars.
> >
> > And my CXML format is highly biased toward generation of events rather
> > than consumption.
> >
> Ok, I understand your point of view. My initial idea was that a simple
> encoding scheme (easily interpretable and not CPU intensive), even if
> not the most efficient is more likely to be widely adopted. But of
> course, it has to have a minimal efficiency ;-)

Download XMill and play with it, it's very efficient and uses the same
libraries that gzip uses. They translate the XML file into a sequence of
xpaths (instead of SAX events) than they compress that by analyzing
specific redundancies... they create an encoding which is optimized for
the algorithms that will try to compress the information, while my
encoding is only done to simplify the job of the decoder thus increasing

Anyway, for an XML wire transfer on slow networks, XMill is probably the
best choice currently available.. followed closedly by GZIP which is
great and already available on HTTP/1.1 compliant browsers.

Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<>                             Friedrich Nietzsche
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------

View raw message