Return-Path: Mailing-List: contact cocoon-dev-help@xml.apache.org; run by ezmlm Delivered-To: mailing list cocoon-dev@xml.apache.org Received: (qmail 25318 invoked from network); 17 Oct 2000 17:14:28 -0000 Received: from pop.systemy.it (194.20.140.28) by locus.apache.org with SMTP; 17 Oct 2000 17:14:28 -0000 Received: from apache.org (pv27-pri.systemy.it [194.21.255.27]) by pop.systemy.it (8.8.8/8.8.3) with ESMTP id TAA01449 for ; Tue, 17 Oct 2000 19:14:17 +0200 Message-ID: <39EC6E61.A2A2C9C6@apache.org> Date: Tue, 17 Oct 2000 17:21:05 +0200 From: Stefano Mazzocchi Organization: Apache Software Foundation X-Mailer: Mozilla 4.72 [en] (Windows NT 5.0; I) X-Accept-Language: en,it MIME-Version: 1.0 To: cocoon-dev@xml.apache.org Subject: Re: XML Compilation References: <39EC2B4A.954408AE@apache.org> <39EC49E6.80BB586F@free.fr> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Rating: locus.apache.org 1.6.2 0/1000/N Sylvain Wallez wrote: > > Great, great ! > > A few suggestions to make the format more compact without going into > complicated compression algorithms : > > - using a byte instead of an int for would divide SAX instructions code > size by 4. I'm already using a byte :) OutputStream.write(int c); already discards the upper 24 bits (read the javadoc to find out) > - repetition of elements name and namespace can be avoided for > "endElement" : XMLInterpreter can hold a stack of open elements to > retrieve these values. hmmm, good point, I'll try that and see if it's worth... during development I found out that not all optimizations end up being such... for example, increasing the buffer size from 8Kb to 16Kb slows things down on my system (which is very strange). > - the first time a string is output, assign it a number (incrementing > counter from the start of the document) and output other occurences of > the string as that number. Since XML is highly redundant, this would > save much, much space. Sure, this increases write time, but will reduce > read time. But this can lead to issues regarding memory consumption, > since it requires to keep all previously read strings. I'm already doing this :) > Several XML compression tools are also listed on > http://www.oasis-open.org/cover/xmlAndCompression.html but I think we're > talking here more about a binary format than compression. No, we are talking about the fastest parsable format you can compile SAX events in. Speed is my main concern, not size (even if, given the same speed, I optimized for size). > BTW, I see several great uses for compiled XML : > > - store pre-parsed XML files on disk (in the repository) to allow fast > reload whenever they're thrown out of the cache. I'm not sure if it's > very useful for XSPs since they're already compiled into class files, > but it will surely be for XSLTs. Totally, this is the main reason to write such a thing. I forecast some "compile by xml WAR" tool inside Cocoon2 that will allow you to precompile all the XSPs and all the static XML documents and package the whole thing for production, indicating all compilation and validation problems. If you think about it, XML and Java are very similar in this concern. > - store element-generation code of XSPs into XML bytecode fragments > using byte[] variables. This will make java files much smaller and help > avoiding reaching the 64k size limit for methods bodies. Yep, this is the where the idea came from (I had a post on cocoon-users about this a while ago talking about this subject). > - Off topic WRT Cocoon, but IMO worth studying : reduce network load > between XML-enabled applications that understand this format. Mmmh, this > can be the first step of a "binary/xml" mime-type ! Once browsers accept > it, we can choose to send XML or XMLC demending on the http-accept > header !! No, that would totally suck and I tell you why: 1) CXML is normally bigger than the original file. Not much, but it rarely compresses (since only string redundancy is eliminated). 2) textual compressors such as gzip compress XML better than CXML. 3) XML compressors (such as XMill) perform much better than gzip even for well-formed documents. CXML focuses on speed then size. XMill focuses on size then speed. Also, CXML is highly asymmetrical: it's much faster to interpret than to compile.... while for normal XML publishing, you need a fast way to "generate" SAX events but also a fast way to "consume" these events and serialize them into a stream of chars. And my CXML format is highly biased toward generation of events rather than consumption. -- Stefano Mazzocchi One must still have chaos in oneself to be able to give birth to a dancing star. Friedrich Nietzsche -------------------------------------------------------------------- Missed us in Orlando? Make it up with ApacheCON Europe in London! ------------------------- http://ApacheCon.Com ---------------------