corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jan i <>
Subject Re: DFStorage
Date Thu, 01 Jan 2015 11:17:32 GMT
On Thursday, January 1, 2015, Peter Kelly <> wrote:

> I realise that I haven’t done a very good job of documenting the code in
> Corinthia, as you’ve probably noticed :) I’ve been meaning to get around to
> this for a while now.
> There’s an “abstract class” (or, more accurately, interface) called
> DFStorage which abstracts over different ways of storing “files” (byte
> arrays/byte streams) - the concrete implementations are 1) memory 2) a
> directory in the filesystem and 3) a zip file.
> The idea with the zip implementation of DFStorage is that you create such
> an object, and when you read from or write to it, it works directly from
> the zip file. For example:
> DFStorage *st = DFStorageOpenZip(“filename.docx”,&error);
> if (st != NULL) {
>     DFBuffer *foo = DFStorageRead(st,…)
>     DFStorageWrite(st,…)
>     // etc.
>     DFStorageSave(st);
>     DFStorageRelease(foo);
> }
> This way, filter code doesn’t need to care how the data is actually stored.
> Now with the current implementation, which is a very simplistic one, it
> simply reads the whole zip file into memory. This is largely due to a
> limitation in the minizip API, which enforces sequential access to the
> entries in a file. It would be conceivable to have the zip DFStorage
> implementation first read a directory listing, and then for each file
> that’s requested, do a linear scan through all the entries before finding
> the requested file, and then reading that. This would be an O(n) operation,
> but would be unlikely to be a major problem since most zip packages we’re
> dealing with will only have a fairly small number of entries.
> Minizip does not provide any way to cache the location in the zip file of
> a particular entry, even though this information would be possible to
> obtain in theory (just not through minizip’s AP). If I were writing a zip
> implementation from scratch (and maybe this is something we could
> consider), I would have it read a list of all entries and remember their
> locations in a hash table, so that when a particular named entry is
> requested, we can go directly to that point in the file without having to
> do a linear scan.
> Writing to zip files is another inconvenient thing, because it’s really
> only possible to do it in an append-only manner. If a large image is
> deleted from a document, or replaced with a modified image, we don’t want
> to keep the old one around; so instead we create an entirely new zip file
> and overwrite the old one. As for reading, the current implementation
> stores all the content in memory and then writes it out to disk in one go
> when you call DFStorageSave(). However for documents containing large
> images this may mean unacceptable amounts of memory usage, depending on the
> application/environment.
> Coming back to what I said in my previous response to Jan about having
> multiple zip files open at a time, it’s not done at the moment within the
> context of a single conversion (but can happen with multiple threads if
> there are several conversions going on at the same time). However, if we
> were to adopt the above approach to limit memory usage of the zip-based
> DFStorage objects, and we were converting say directly from one zip-based
> file format to another (think OOXML to ODF), this would require the ability
> to have multiple zip files open at the same time, and in the same thread.
> On the question of providing our own versions of the APIs for external
> libraries, I guess I can sort off see some benefit now there in the sense
> that we can, in theory at least, swap out a different implementation. But
> this is only possible if the other implementation works the same
> semantically, with only a different syntax. For example the use wrapper
> functions in an of itself does not really help much IMHO, as the way in
> which a piece of given functionality is exposed may differ between
> libraries. I’m not opposed to wrapper functions as in Jan’s branch as such,
> but it’s just some food for thought.

I agree that it  food for thought. My idea with the API is the first step
in providing what you suggest,  with the API I can work in platform to
first remove minizip e.g. use an external (non source library) and then add
the scanning functionality we need.

I just want a clear separation between platform/library code and
application code. BUT we should only make platform APIs where it makes

jan i

> —
> Dr Peter M. Kelly
> <javascript:;>
> PGP key: <>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Sent from My iPad, sorry for any misspellings.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message