xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: Performance, XML messages compression
Date Mon, 07 Feb 2000 13:42:06 GMT
Thomas Yip wrote:
> 
> My compression knowledge limited to one course in univeristy.

Being data compression algorithms one of my favorite hobbies, I think I
can add more on this.
 
> But, what I understand is that compression base on LZW like WinZip,
> gz etc and will do well on file lot of repetitions, which is just XML.

You're right, but XML is not text, is XML so, creating a
specifically-tuned-for-xml compression algorithm would improve
compression. In fact, XML verbosity doesn't add any more information to
the stream.

> And, the compression which used in modem, base on LZ77 do exceptionally
> well on text compression.

True, but they work creating an adapting dictionary. If you think about
it, your schema tells you _a_lot_ about how the document can be. By
using a schema as a good °seed° to your token dictionary, you can, say,
express a tag with a few bits instead of using all the letters.
 
> So, in both base, you can except good compression rate. So, I don't
> really think a new compression is neccessary.

To be honest, I really can't tell. Lempel and Ziv created the LZ
algorithms to °measure° the amount of information of a stream, they
didn't want to compress it. But of course, information is present only
if you can recreate the original message. So they achieve compression.

The best thing is to invent a new algorithm based on specific XML
properties. Then put it in practice and compare it with other non-lossy
text based algorithms.... Hmmmm, if I just had the time....
 
> I think it is realistic to except compressed xml file will be 20% to 35%
> of original size. Does it enough in your case?
> For comparsion, average text file is about 50%.

One thing is for sure: XML is normally much more compressible than text.
Exactly like PostScript. Since the "system information" is very verbose
and doesn't add to the real information and amount of entropy of the
stream.
 
> >      Does anyone know or any working groups that are looking at a standard
> >      for XML compression.

AFAIK, no.

> >      I'm specifically looking at the B2B senario, where a high volume of
> >      messages will be flowing between companies. This is obviously a
> >      performance issue.

Yep. I'd welcome such an effort in xml.apache.org.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<stefano@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Come to the first official Apache Software Foundation Conference!  
------------------------- http://ApacheCon.Com ---------------------



Mime
View raw message