From "Dennis E. Hamilton" <>
Subject RE: OOXML
Date Mon, 04 Aug 2014 16:01:20 GMT
It is important to understand that an XML DOM does not capture all of the constraints and referential
requirements within an ODF document.  In particular, content.xml does not have everything
and there are references using XLink (relative hrefs) and also special identifiers (not IDREFs)
to other files, whether for binary attachments or into other defined parts (styles.xml and
meta.xml for two).

There is also considerable internal structuring that is off-hierachy.  Some of the connections
are via fragment IDs (xml:id) and IDREFs, others are by identifiers (not IDs and IDREFs) that
are introduced in the ODF specification but which are not modelled in the Relax NG Schema
(beyond saying they have string values, for example).

This sort of thing also happens rather heavily in OOXML, where communication among parts uses
a unique cross-part relationship model.  There are also many cross references to named components
by other than XML IDs and IDREFs, whether or not the components and the references occur in
the same part of the OPC package.

One could continue the kind of hack that plants that information as benign markers into an
internal form of the XML parts (even as a single XML document, although that is tricky when
ODF documents are nested as subdocuments of another), so long as they are replaced when the
XML document is committed to a saved ODF document file format.

In terms of having a DOM that maps to the external file form and a different internal model,
the only time that the internal model needs to update the externally-oriented DOM is as part
of a Save operation.  There might be more coupling, but performance and storage issues will
doubtless impact the engineering outcome, especially for handling large documents with alacrity.
 Copy and paste and undo management will also be factors, along with maintaining pagination,
word counts, and such.

On the other hand, it is convenient (practically necessary) to specify the semantics of ODF,
or some profile of ODF, as if operations are on the format itself, since it is only the format
that is more-or-less well-specified.  It would be interesting to know how much this could
be taken literally in an application.  I think there might be forensic tools on ODF documents
that might be able to operate that way.  I'm not at all certain about production WYSIWYG consumers
and producers, especially ones implemented to harmonize between OOXML, ODF and other interesting
formats (EPUB coming to mind).

I will watch Peter Kelly's efforts with great interest to see how much the boundaries can
be moved in this area.

From: Peter Kelly [] 
Sent: Monday, August 4, 2014 01:27
Subject: Re: OOXML

On 4 Aug 2014, at 12:16 am, jan i <> wrote:

It's possible in theory, though I'm not familiar enough with the OO codebase to say whether
it would work in practice.

The key idea is to maintain two separate data structures - one which is the ODF XML trees,
and another which is the internal representation. Any time a change gets made to the former,
the implementation must update the latter to reflect the change. Modification operations on
the latter would need to go in the other direction.

In the case of UX Write, there's a few instances where I've used custom extensions to handle
certain things. The main ones are:

1. Table of contents/list of tables/list of figures.

When you insert one of these into your document, it inserts a <nav> element with a CSS
class name of "tableofcontents", "listoffigures", or "listoftables", which were chosen as
these are the same keywords that LaTeX uses for these features. UX Write treats these as having
special meaning, in the sense that when opening a document (and when the document is modified),
it updates the content of these <nav> elements based on the set of all heading, figure,
or table elements in the document (including numbering/captions).

2. OOXML-specific features.

When converting from .docx to .html during the process of opening a document, it assigns certain
pre-defined CSS class names to particular types of HTML elements to indicate their purpose.
For example, a cross-reference whose display format is supposed to include both the label
and caption of a figure will be translated as:

