corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis E. Hamilton" <dennis.hamil...@acm.org>
Subject RE: Zip madness !
Date Sat, 01 Aug 2015 15:40:34 GMT
It doesn't matter.  The structure of Zip archive files is what it is and it is being used on
the document formats that interest us.  We have no choice in the matter [;<).

There are profiles of Zip when it is employed as a carrier for standard document-file formats.
 It is important to know (1) the Zip specification that is the basis for a standard format
and (2) the profile that is used.  That applies to OOXML (in the OPC portion of the spec)
and ODF (in Part 3 of ODF 1.2, section 17 of ODF 1.1). It applies for ePub also.  There is
also now a common ISO profile of Zip that is intended to provide a progression of layers for
use in support of document-format specifications.  
 
 - Dennis

SOME BACKGROUND

The local file headers are often produced serially as the archive is built and are there for
serial processing of the Zip on structures that do not allow random access into the stream.
 (OPC has a level of abstraction that allows more-efficient streaming over networks and in
cloud applications but I don't know how much that is exploited outside of Microsoft products.
 You may find it interesting to know that Visual Studio employs OPC in a variety of ways in
carrying development artifacts.)

The global directory, at the end is a cross check and, for positionable streams, an additional
support for ensuring that the Zip has not been damaged.  In some cases, the global directory
has more information than local file headers, since such details might only be known after
the local file stream has been produced (checksums for example, even the length of a stream),
and the global forms can employ larger pointers and sizes than can be used in the local file
headers.  The global directory might also be usable in recovery of data from a damaged Zip
for which an intact global directory is still present.  For programs on modern file systems,
I suspect that the global directory is used almost exclusively, although the local file headers
are still there, and correct.  In fact, some programs "sniff" the first local file header
of ODF packages to detect the "mimetype" file entry, although it is not required that it be
the first local file header.

I find all of this intriguing, myself.  It is a challenge to provide a durable model that
delivers an useful API above the physical Zip structure that adapts to available capabilities
and removes concern for such details, allowing isolation under a better abstraction for use
on behalf of a document format.

-----Original Message-----
From: jan i [mailto:jani@apache.org] 
Sent: Saturday, August 1, 2015 02:33
To: dev@corinthia.incubator.apache.org
Subject: Zip madness !

Hi

Does anybody know why zip has a mad inefficient directory structure ?

I try to understand the why, but fail.

A zip file, contains 1 global directory with information about every single
file (flat structure, no
sub directories, but filenames may contain a "/"). That is logical and
expected.

BUT in front of every file, there are a local file header, with filename
about 3/4 of the information
from the global directory. This information seems pure redundant and
unneeded.

What am I missing here ? on one of my test docx, the local headers are
about 10% of the filesize (looong filenames) which could be thrown away.

Hope somebody can see what I failed to see.
rgds
jan i.


Mime
View raw message