openoffice-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Kelly <kelly...@gmail.com>
Subject Re: OOXML
Date Sun, 03 Aug 2014 11:25:02 GMT
On 3 Aug 2014, at 1:57 am, jan i <jani@apache.org> wrote:

> I too am on peter fast rolling waggon :-) but I am also confused.
> 
> @peter maybe you could explain a couple of things, for non-document
> specialists:
> 
> 1) Following your thought, with biderectional editors. Why would a editor
> have a home format ?

There's two ways to view a format: (1) as a way of encoding information for storage or transmission,
and (2) as an in-memory data structure used by the editor at runtime. In some programs these
are two different things, and in others they are the same. The latter is true of web browsers
- HTML is both the file format and the runtime data model; the W3C DOM APIs can be used to
manipulate the HTML structure directly. I believe this was also true to a large extent with
the binary formats used by older versions of MS Office, for purposes of efficiency [1].

I'm not familiar with the internals of OpenOffice - one thing I'd be very interested to know
is does it use ODF for it's in-memory representation of the document? Or are the runtime data
structures used different to the XML trees that one finds in an ODF package?

> Following your thought to the end, the editor would always save/read in the
> format, and things not supported in the format with be saved as private.

The issue of how to handle features not supported by the format is a tricky one. My initial
view is that those features are best disabled if the user chooses to save in that format (or
alternatively a warning message shown on save), since even if there were private extensions
saved in the foreign format, they won't be supported in other apps, and are not guaranteed
to be preserved (see further below).

> 2) When editing in format foo, one can expect that not all features are
> supported (like e.g. microsoft macros), these are handled as private
> containers.
> 
> But looking at LO there seems to be huge challenges when doing especially
> copy/paste operations ?

Yes, this is a very tricky problem. Even with a simple bidirectional transformation model,
where you have a 1:1 mapping between elements in the concrete document and elements in the
abstract document (concrete = original format, abstract = format used by the editor), it's
not possible to know what should be done for elements that have been copied & pasted.

One approach would be to make the mapping 1:n, where if an element in the abstract (editable)
document is copied & pasted one or more times, then its corresponding element in the concrete
document is also duplicated at save time when the file is updated. However, this can potentially
violate uniqueness constraints, e.g. if the element being copied is supposed to have a unique
identifier, you can't just go making a direct copy of it, as you'd end up with two elements
with the same identifier. However, if the implementation was aware of such uniqueness constraints
for specific elements it could ensure these are still respected, even if it doesn't support
any other aspects of the element (e.g. editing or rendering).

Cut & paste is much easier to handle though as it's equivalent to a move operation, which
doesn't have any implications for uniqueness constraints.

> 3) If we save private info in .docx, how can be be sure that a microsoft
> editor does not destroy it ?
> 
> Does the standard contain some rules about keeping private information ?

Well, we can never be *completely* sure that a microsoft editor won't destroy something ;)

Having said that though, there are a couple of provisions for this. One is simply the ability
to include extra files in the package, labeled with a particular namespace. Each OOXML package
contains a "relationship graph", which is a separate data structure from the zip file's directory
hierarchy, and is what OOXML uses to identify "parts" (files) within the package. In principle,
there should be no problem with simply adding an extra part with whatever namespace you like,
and that being preserved. However, this isn't guaranteed if an implementation does an import/export,
since usually any extra information gets lost on import and is no longer there by the time
export occurs.

I've just done a test on this in fact, to see how different implementations handle it. I added
an extra XML file to a package, and referenced it from the relationships graph. Under Word
2011 and Word 2013, this file was preserved after modification. Under LibreOffice Writer however,
the file disappeared from the package after a save. I suspect this is due to the file being
imported into either ODF of LibreOffice's own internal data model, and thus the extra information
being missing on save (if any of the LO developers are reading this... perhaps you can comment
here).

Ironically the warning message LO displayed when I tried to save the file was 'This document
may contain formatting or content that cannot be saved in the currently selected file format
"Microsoft Word 2007/2010 XML". Use the default ODF file format to be sure that the document
is saved correctly". In fact, in this instance, the exact opposite is the case - the information
*could* be saved in OOXML (if it were not previously lost on import), but could *not* be saved
in ODF. I think this is a good example of why bidirectional transformation is so important
for achieving true compatibility - since it means you *don't* lose information on save. The
fact that it works in MS Office is possibly more luck than anything else, since it wouldn't
need to do an import.

The second way in which OOXML caters for foreign extensions is a set of XML elements which
can be used to indicate how a consumer should treat content it doesn't know about. This is
described in part 3 of the spec, "Markup Compatibility and Extensibility (MCE)". Essentially
this provides a way of saying to a consumer "hey, I've got this extra info in a custom format,
and you should use that if you support the particular namespace; otherwise, here's some fallback
content you can use instead". It also lets you say to the consumer "just ignore elements in
this namespace if you don't support it".

Unfortunately however, I don't believe there's any guarantee that these are preserved either.
In the case of UX Write, where there is a piece of content stored in multiple formats, it
just throws away the ones it doesn't support (one of the few cases in which UX Write's .docx
support is not fully bidirectional). This is something I should arguably fix, as potentially
there may be useful information lost. The only instance I've seen it used in practice though
is where there's a new, proprietary feature introduced in a later version of Office; e.g.
in Word 2010 or later if you draw a circle in your document, it will (and I'm not making this
up) store two versions of the circle - one a special Word 2010 namespace which is not defined
in the OOXML spec, and another representation of the circle in the older VML format (which
for some reason mainly consists of a "o:gfxdata" attribute containing binary data encoded
in base 64 - but hey, at least it's in XML, right? ;)

To summarise, I think that storing private/extension information in a foreign file format
should be considered unreliable, since implementations tend to differ a lot on their support
for this. Therefore, one should do so if there's no major consequence to losing that information.
It also kind of goes against the idea of having a standard in the first place.

[1] http://www.joelonsoftware.com/items/2008/02/19.html

--
Dr. Peter M. Kelly
kellypmk@gmail.com
http://www.kellypmk.net/

PGP key: http://www.kellypmk.net/pgp-key
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)


Mime
View raw message