commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Bodewig <>
Subject [compress] ZIP64: API imposed limits vs limits of the format
Date Thu, 04 Aug 2011 14:48:26 GMT
Hi all,

ZIP64 support in trunk is almost complete except for a case that is
pretty easy to implement but where I'll ask for API feedback once I've
managed to read up on how the InfoZIP people handle it.

There are a few places where our implementation doesn't allow for the
full range the ZIP format would support.  Some are easy to fix, some
hard and I'm asking for feedback whether you consider it worth the
effort to fix them at all.

OK, here we go.

Total Size of the archive

There is no formal limit inside the format in particular since ZIP
archives can be split into multiple pieces.  For each individual piece
the last "local file header" can not have an offset of more than 2^64-1
bytes from the start of the file.

We don't support split archives at all so the size is limited to
one file.

ZipArchiveInputStream should work on arbitrary sizes.

ZipFile relies on RandomAccessFile so any archive can't be bigger than
the maximum size supported by RandomAccessFile.  In particular the seek
method expects a long as argument so the hard limit would be an archive
size of 2^63-1 bytes.  In practice I expect RandomAccessFile to not
support files that big on many platforms.

This is a "hard" case IMHO, I don't see how we could implement ZipFile
without RandomAccessFile in any efficient way.

ZipArchiveOutputStream has two modes.  If it writes to a file it will
use RandomAccessFile internally otherwise it writes to a stream.  In
file mode the same limits apply that apply to ZipFile.

For the streaming mode offsets are currently stored as longs but that
could be changed to BigIntegers easily so we could reach 2^64-1 at the
expense of memory consumption and maybe even some performance issues
(the offsets are not really used in calculations so I don't expect any
major impact).

Size of an individual entry (compressed or not)

The format supports an unsigned 64 bit integer as size, ArchiveEntry's
get/setSize methods use long - this means there is a factor of 2.

We could easily add an additional setter/getter for size that uses
BigInteger, the infrastructure to support it would be there.  OTOH it is
questionable whether we'd support anything > Long.MAX_VALUE in practice
because of the previous point anyway.

Number of files entries the archive

This used to be an unsingned 16 bit integer and has grown to an
unsigned 64 bit integer with ZIP64.

ZipArchiveInputStream should work with arbitrary many entries.

ZipArchiveOutputStream uses a LinkedList to store all entries as it has
to keep track of the metadata in order to write the central directory.
It also uses an additional HashMap that could be removed easily by
storing the data together with the entries themselves.  LinkedList won't
allow more than Integer.MAX_VALUE entries which leaves us quite a bit
away from the theoretical limit of the format.

I'm confident that even I would manage to write an efficient singly
linked list that is only ever appended to and that is iterated over
exactly once from head to tail.  I'd even manage to keep track of the
size inside a long or BigInteger (if deemed necessary) in a O(1)
operation ;-)

So ZipArchiveOutputStream could easily be fixed if we wanted to.
Whether it is worth the effort is a different question when the size of
the file is still limited to a single "disk" archive.

ZipFile is a totally different beast.  It contains several maps
internally and I don't really see how to implement things like

           ZipArchiveEntry getEntry(String name)

efficiently without a map.  I don't see myself writing an efficient map
with a capacity of Long.MAX_VALUE or bigger, either.

And even if we had one, there'd still be the "archive size" limit.

We could stick with documenting the limits of ZipFile properly.  In
practice I doubt many people will have to deal with archives of 2^63
bytes or more.  And even archives with 2^32 entries or more should be
rare - in which case people could fall back to ZipArchiveInputStream.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message