openoffice-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Kelly <kelly...@gmail.com>
Subject Re: OOXML
Date Mon, 04 Aug 2014 08:27:15 GMT
On 4 Aug 2014, at 12:16 am, jan i <jani@apache.org> wrote:

> By painfull experience, I found out that our internal (memory) structure is
> a superset of mixed ODF and pre-odf items. I dont think you can have a pure
> odf/OOXML memory structure, you need internal pointers as well (like
> start/finish of copy buffer)...but of course those 2 parts should have been
> well separated.

It's possible in theory, though I'm not familiar enough with the OO codebase to say whether
it would work in practice.

The key idea is to maintain two separate data structures - one which is the ODF XML trees,
and another which is the internal representation. Any time a change gets made to the former,
the implementation must update the latter to reflect the change. Modification operations on
the latter would need to go in the other direction.

This is how WebKit works (well, at least how it worked last time I touched the code, which
was more than 10 years ago...). There is the DOM tree and the rendering tree. The DOM tree
stores the HTML structure exactly as it was parsed from the original file; this is accessible
to javascript code and can be modified in arbitrary ways. Whenever the DOM tree changes, WebKit
updates its rendering tree, based both on the DOM tree and applicable rules from the CSS stylesheet.
The rendering tree is the internal model which is used for displaying the content on screen.

Importantly, the DOM tree is also allowed to contain arbitrary XML elements in any namespace.
This is how WebODF works; it includes the content.xml from the package directly, and that's
the "authoritative" data structure that is manipulated during editing. The CSS rules WebODF
uses control rendering of the content.

> I wonder, you wrote earlier that UXwrite uses html internally, that seems
> for me as the lowest common nominator...I would have thought a real
> superset would have been the better choise ?

Well a convenient thing about HTML is that you can include your extensions without affecting
the rendered output, or risking loss of the data. This includes custom elements, custom attributes,
and CSS style names that you may choose to assign special meaning to.

The reasons for this are largely due to the way in which HTML has historically evolved...
browsers deliberately allow the presence of "invalid" elements they don't know about, to cater
for future versions of the spec which add new elements. The idea is "graceful degradation",
such that if you try to view a site that uses some new HTML features your browser doesn't
support, it should at least in theory still let you see most of the content, just that you
won't be able to use the new features. Depending on the HTML/CSS design, this works better
in practice on some sites than on others. Then of course there's JavaScript APIs which can
cause compatibility issues, though that's a separate topic, and the browser will usually at
least display the content even if it can't do dynamic stuff because the JS code threw an exception.

In the case of UX Write, there's a few instances where I've used custom extensions to handle
certain things. The main ones are:

1. Table of contents/list of tables/list of figures.

When you insert one of these into your document, it inserts a <nav> element with a CSS
class name of "tableofcontents", "listoffigures", or "listoftables", which were chosen as
these are the same keywords that LaTeX uses for these features. UX Write treats these as having
special meaning, in the sense that when opening a document (and when the document is modified),
it updates the content of these <nav> elements based on the set of all heading, figure,
or table elements in the document (including numbering/captions).

2. OOXML-specific features.

When converting from .docx to .html during the process of opening a document, it assigns certain
pre-defined CSS class names to particular types of HTML elements to indicate their purpose.
For example, a cross-reference whose display format is supposed to include both the label
and caption of a figure will be translated as:

<a href="#idN" class="uxwrite-ref-label-num">...</a>

where N is the id of the target. The editing code knows about these class names and uses them
to update the text inside the <a> element if the figure number or caption changes. Similarly,
where there is an unsupported object, like an embedded spreadsheet, it will translate this
as:

<span class="uxwrite-placeholder">[Unsupported object]</span>.

During editing, WebKit preserves these, since they're just CSS class names and don't in any
way cause problems with the HTML or rendering. All of the core editing operations are implemented
in javascript, and these take the class names into account where appropriate.

3. Element mappings for bidirectional transformation.

For every HTML element that is generated from an OOXML element, it sets the id attribute to
a string of the form bdt(N)-(M), where N is a randomly-generated number for each editing session,
and M is the sequence number of the element in the OOXML tree. The purpose of the randomly-generated
N value is to ensure that there aren't mixups for BDT updates if that HTML content gets copied
& pasted into another document within UX Write itself. The number used for the M value
is the position of the element in a pre-order traversal of the XML tree of document.xml. In
cases where the element corresponds to an XML file in the package that is *not* the main content
(currently only for the case of footnotes and endnotes), it is prefixed with a string identifying
the file, so it can be properly identified.

When a document is saved, and the BDT update process takes place, it uses these to re-establish
the relationship between elements in the HTML file and elements in the OOXML content tree,
and figure out where changes have taken place. Given this mapping, it is able to update the
OOXML file based on content from the HTML file.

This is all fully conformant with the HTML spec, as it allows you to choose whatever values
you want for id attributes. And the editor neither knows nor cares whether the file it's working
with was stored as .html or .docx; what happens on save is entirely separate from what happens
during editing. In the case of HTML, the file is just saved directly, and in the case of .docx,
the BDT process described above occurs. I'll be using exactly this same approach for supporting
.odt files.

4. Extra elements to indicate selection

The iOS version of WebKit has a broken selection API (or at least did at the time I began
writing the app, which was in the days of iOS 5), so I had to "fake" selections by creating
my own <div> and <span> elements with the light-blue background colour. These
are just regular HTML elements with CSS styling - nothing special about them. The editor keeps
track of which elements in the document are used for faking selections, and these are removed
before save; it's a runtime thing only.

In addition to all of the above, there are additional data structures maintained by the javascript
code for information that isn't possible to represent (or doesn't make sense to represent)
in the HTML structure itself. This includes a list of undo/redo operations, event listeners
for changes to elements that would affect the table of contents/cross-references, an abstract
tree representing the document outline, and so forth. These are all javascript objects; but
they are separate from the DOM tree, and as far as opening & saving a file is concerned,
have no effect on that. The HTML DOM remains the core data structure used, and WebKit preserves
all the information needed.

> Some parts of AOO uses the structure directly others go through the API,
> that is not very clean, and makes it extremly difficult to test chaanges in
> the internal memory layout. An application like this (and many other
> similar types), should see the memory as a capsule, with a fixed API around
> it.

Agreed; I think it's important to maintain a separation between the internal data structures
used by the editor and other code (file format loading/saving, automated tests, and plugins),
so that the internal structures can be changed without affecting any of these.

--
Dr. Peter M. Kelly
Founder, UX Productivity
peter@uxproductivity.com
http://www.uxproductivity.com/
http://www.kellypmk.net/

PGP key: http://www.kellypmk.net/pgp-key
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)


Mime
View raw message