corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Kelly <>
Subject DFStorage
Date Thu, 01 Jan 2015 10:58:16 GMT
I realise that I haven’t done a very good job of documenting the code in Corinthia, as you’ve
probably noticed :) I’ve been meaning to get around to this for a while now.

There’s an “abstract class” (or, more accurately, interface) called DFStorage which
abstracts over different ways of storing “files” (byte arrays/byte streams) - the concrete
implementations are 1) memory 2) a directory in the filesystem and 3) a zip file.

The idea with the zip implementation of DFStorage is that you create such an object, and when
you read from or write to it, it works directly from the zip file. For example:

DFStorage *st = DFStorageOpenZip(“filename.docx”,&error);
if (st != NULL) {
    DFBuffer *foo = DFStorageRead(st,…)
    // etc.

This way, filter code doesn’t need to care how the data is actually stored.

Now with the current implementation, which is a very simplistic one, it simply reads the whole
zip file into memory. This is largely due to a limitation in the minizip API, which enforces
sequential access to the entries in a file. It would be conceivable to have the zip DFStorage
implementation first read a directory listing, and then for each file that’s requested,
do a linear scan through all the entries before finding the requested file, and then reading
that. This would be an O(n) operation, but would be unlikely to be a major problem since most
zip packages we’re dealing with will only have a fairly small number of entries.

Minizip does not provide any way to cache the location in the zip file of a particular entry,
even though this information would be possible to obtain in theory (just not through minizip’s
AP). If I were writing a zip implementation from scratch (and maybe this is something we could
consider), I would have it read a list of all entries and remember their locations in a hash
table, so that when a particular named entry is requested, we can go directly to that point
in the file without having to do a linear scan.

Writing to zip files is another inconvenient thing, because it’s really only possible to
do it in an append-only manner. If a large image is deleted from a document, or replaced with
a modified image, we don’t want to keep the old one around; so instead we create an entirely
new zip file and overwrite the old one. As for reading, the current implementation stores
all the content in memory and then writes it out to disk in one go when you call DFStorageSave().
However for documents containing large images this may mean unacceptable amounts of memory
usage, depending on the application/environment.

Coming back to what I said in my previous response to Jan about having multiple zip files
open at a time, it’s not done at the moment within the context of a single conversion (but
can happen with multiple threads if there are several conversions going on at the same time).
However, if we were to adopt the above approach to limit memory usage of the zip-based DFStorage
objects, and we were converting say directly from one zip-based file format to another (think
OOXML to ODF), this would require the ability to have multiple zip files open at the same
time, and in the same thread.

On the question of providing our own versions of the APIs for external libraries, I guess
I can sort off see some benefit now there in the sense that we can, in theory at least, swap
out a different implementation. But this is only possible if the other implementation works
the same semantically, with only a different syntax. For example the use wrapper functions
in an of itself does not really help much IMHO, as the way in which a piece of given functionality
is exposed may differ between libraries. I’m not opposed to wrapper functions as in Jan’s
branch as such, but it’s just some food for thought.

Dr Peter M. Kelly

PGP key: <>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message