corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Kelly (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (COR-18) Replacing MiniZip
Date Mon, 19 Jan 2015 01:25:34 GMT

    [ https://issues.apache.org/jira/browse/COR-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282064#comment-14282064
] 

Peter Kelly commented on COR-18:
--------------------------------

Regarding handling of paths and directories: The necessary logic for this is already implemented
for OPC (Open Packaging Conversions, used by OOXML) in filters/ooxml/common/OPC.c. The header
defines several classes, including OPCPackage, OPCPart, and OPCRelationship which provides
access to and manipulation of the package OPC level of abstraction rather than that of a directory
hierarchy (within a zip or elsewhere).

Relative paths are handled appropriately. At load time, the function which does this, OPCPackageReadRelationships,
calls through to DFPathResolveAbsolute to ensure the correct path is obtained. At save time,
paths are all written as absolute.

I've just put up a description of the DFStorage API (https://cwiki.apache.org/confluence/display/Corinthia/DFStorage)
which is sort of like a virtual filesystem layer, in that it allows us to store a set of files
in various places (currently a directory on the filesystem, a zip file, or in memory only).
This deliberately works on a "flat" model where there's no inherent concept of directories
- the aim being to keep the api simple. However, one can include path separators in filenames,
so it's still possible to store files at different points in a hierarchy.

The only limitation to this API is that it doesn't support "whole of directory" operations
like moving, copying, creating, deleting etc. This was a deliberate decision, make to cut
down on the complexity of code - there were quite a lot of places where code would e.g. check
for the existence of an images directory and create it if necessary before adding a .jpg file
to a docx package. For our particular requirements, this was unnecessary complexity - so now
you can simply give it a path name for a file and it will take care of the rest.

> Replacing MiniZip
> -----------------
>
>                 Key: COR-18
>                 URL: https://issues.apache.org/jira/browse/COR-18
>             Project: Corinthia
>          Issue Type: Bug
>          Components: DocFormats - platform
>         Environment: source
>            Reporter: jan iversen
>            Assignee: jan iversen
>            Priority: Blocker
>             Fix For: 0.5
>
>
> MiniZip is a bit thin and, because of some changes needed, it might be better to replace
it in the DocFormats/3rdparty/external/ folder, as @peterkelly observes at COR-26 (comment)
> EASY STEPS
> For now, it might be desirable to simply replace the current code with MiniZip 1.1 from
http://www.winimage.com/zLibDll/minizip.html
> Since it is a simple dependency, this should work fine so long as there are no breaking
API changes in between 1.0h and 1.1.
> EVENTUALLY?
> It would be good to have something behind a stable API that permits random access for
reading file streams as Peter suggests. Ideally, that API would be aligned around the Document
Container File (DCF) profile of the official PKWare specification that is used commonly among
ePub, ODF, and the Open Packaging Conventions (OPC) used in OOXML and elsewhere. I don't know
what the latest status of that profile is at ISO/IEC JTC1 SC34, but it will become a common
international specification for these specialized usage of Zip as a compound document-format
container file.
> There are other places to look for ideas and possible sources of reusable code and API
considerations, including in Apache OpenOffice, the Apache ODF Toolkit (using Java). , and
the Microsoft open-sourcing of its OOXML-access layer (in .NET I think). And the Microsoft
platform has some native support that it might be useful to be able to rely on in Windows-targeted
builds.
> There is also a CodePlex LibOPC project that is C code under a BSD-form license at https://libopc.codeplex.com/
One interesting feature of LibOPC that may interest Apache OpenOffice folk (i.e., @janiversen)
is a python script for generating Visual Studio projects that can be used for manipulating
and building on Windows.
> One caveat. For ingesting Zip-based document files, there needs to be a fair amount of
code to ensure resiliency and defense against DOS-ing of applications with malformed document
files. That may have to be grown, with attention to the code footprint on limited-capacity
devices (where presumably some of the heavy-lifting is off-loaded to the cloud). It is an
interesting feature of the OPC specification is that it is also designed to support remoting
of the document streams in a way where there is no requirement that a Zip file be transferred
to the client. That may be very much eventually, but it is useful to think about having an
API that would allow for that underneath.  [Ed.Note: COR-31 is related to this.]
> LEST WE FORGET?
> Although this is all .NET-fu, there may be useful ideas on this project,
> https://github.com/OfficeDev/Open-Xml-Sdk
> as a source of ideas (and some of the system-level dependencies may have Native Windows
counterparts as well). This might be useful for mining for other ideas higher up in the API
modeling too.
> ---
> I didn't think to mention POI and whatever they use as a model close to the Zip packages.
> I didn't realize until looking at the proposal to become an Apache incubator project
that the sources for minizip and tidy-html5 are not pristine. It would be good to reconstruct
the modification process and leave more footprints if the changes are not in the repository
here. (Actually, it would be good to reconstruct the modification anyhow, but diffs from git
would be helpful.)
> I'm thinking that there is no hurry to replace these in early stages. If a better API
is desired, the first step of getting that in place would be to build a shim that goes from
that API to anything hand at first, such as minizip or some other library, and worry about
fit and performance later.
> jan: 
> POI is in java, so they have other packages available.
> I am currently working on expanding the platform part to also include zip and html, so
that we can change the libraries at a later stage. I think your idea of using libOPC is valid
and interesting...you, peter and svante knows better if it fits to the project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message